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Project  Summary 


The  principal  focus  of  this  project  was  on  developing  new  statistical  algorithms  for 
analysis  of  electromagnetic  induction  (EMI)  and  magnetometer  data  measured  at  actual 
former  bombing  ranges.  To  address  this  challenge,  four  key  technologies  have  been 
developed: 


•  Active  learning:  Active  learning  is  a  framework  whereby  the  labeled  (training) 
data  used  to  design  a  classifier  are  defined  in  situ  directly  on  the  site  under  test. 
This  is  performed  using  information-theoretic  measures,  in  which  one  asks  which 
signatures  from  the  site  under  test  would  be  most  infonnative  for  classifier  design 
if  the  associated  labels  could  be  acquired.  These  labels  are  then  acquired  via 
excavation.  In  this  framework  one  assumes  no  a  priori  labeled  (training)  data,  and 
the  initial  phase  of  excavation  is  performed  for  the  purpose  of  learning  an 
algorithm.  Once  the  information-theoretic  measure  indicates  that  no  further 
information  is  available  from  further  excavations,  the  excavations  for  the  purpose 
of  learning  are  terminated.  At  this  point,  using  the  acquired  labeled  data,  a  dig  list 
is  provided,  defining  an  ordered  list  of  which  items  are  most  likely  to  be  UXO. 
Since  the  output  of  this  process  is  defined  in  terms  of  the  probability  of  being 
UXO,  a  risk-based  analysis  may  also  be  employed,  to  define  when  to  stop 
excavating  (when  the  risk  analysis  indicates  that  the  expected  cost  is  minimized 
by  leaving  the  remaining  items  unexcavated). 

•  Concept  drift:  In  the  above  discussion  it  was  assumed  that  no  labeled  (training) 
data  were  available.  However,  in  practice  one  may  have  labeled  data,  for  example 
from  previous  excavations  at  different  former  ranges.  The  challenge  is  that  the 
data  from  a  previous  site  may  not  be  entirely  relevant  for  the  new  site  under  test. 
Speaking  in  a  statistical  sense,  the  site  under  test  may  be  characterized  as  one 
“concept”,  and  there  may  be  a  “concept  drift”  from  this  new  site  relative  to  data 
from  previous  sites  analyzed  previously.  Therefore,  the  objective  is  to  learn  the 
“concept  drift”  between  the  site  of  interest  and  previous  sites,  and  once  this 
statistical  relationship  is  understood/leamed,  one  may  appropriately  utilize 
previous  existing  labeled  data.  In  concept-drift-based  learning,  one  again  employs 
active  learning,  whereby  an  initial  set  of  excavations  are  performed  for  the 
purpose  of  learning.  However,  in  this  context  one  is  interested  in  learning  the 
inter-relationship  between  the  current  site  under  test  and  previous  labeled  data 
sets  that  are  available.  The  final  dig  list,  after  excavation  for  the  purpose  of 
learning,  utilizes  all  of  the  excavated  labeled  data  from  the  site  of  interest,  as  well 
as  an  appropriately  weighted  version  of  the  labeled  data  from  previous  sites.  In 
this  sense  the  algorithm  learns  to  appropriately  utilize  all  available  data. 

•  Semi-supervised  learning:  Most  existing  classification  algorithms  are 
supervised.  This  implies  that  a  set  of  labeled  data  are  provided  to  the  classifier, 
from  which  learning  is  constituted.  The  classifier  so  learned  is  then  applied  one  by 


one  to  each  of  the  unlabeled  signatures,  for  which  a  classification  (UXO/non- 
UXO)  is  desired.  In  UXO  sensing  one  typically  has  access  to  all  of  the  unlabeled 
data  simultaneously  (since  all  of  the  data  are  collected  at  once,  typically). 
Therefore,  there  is  an  opportunity  to  place  the  classification  of  any  given 
signature  within  the  context  of  all  other  unlabeled  signatures  (since  all  unlabeled 
data  are  available  simultaneously).  When  one  takes  account  of  the  properties  of 
the  unlabeled  data  when  learning  a  classifier,  this  is  tenned  semi-supervised 
learning.  Duke  has  developed  novel  semi-supervised  algorithms,  and  applied 
them  successfully  to  measured  UXO  data. 

•  Sensor  management:  The  active-learning  and  concept-drift  algorithms  address 
the  issue  of  what  is  termed,  in  the  statistics  community,  “incomplete”  or 
“missing”  data.  Specifically,  the  unlabeled  data  from  a  given  site  of  interest  is 
“missing”  labels,  and  in  that  sense  it  is  “incomplete”.  The  active-learning 
framework  seeks  to  fill  in  missing  data  in  an  optimal  manner,  with  targeted 
(information-theory-based)  excavations,  for  the  purpose  of  learning  the  statistical 
characteristics  of  the  new  site  under  test.  Another  form  in  which  missing  data  is 
manifested  is  if  only  a  subset  of  sensors  is  deployed  in  a  given  region.  For 
example,  assume  that  EMI  and  magnetometer  sensors  are  available.  A  given 
portion  of  a  site  may  be  interrogated  by  EMI,  another  region  by  magnetometer, 
and  a  third  by  both  of  these  sensors.  There  is  “missing  data”  in  those  regions  for 
which  only  one  of  the  two  sensors  is  deployed.  Similar  issues  may  exist  for  the 
spatial  sample  rate  of  the  data,  even  for  a  single  sensor:  If  data  are  sampled  in 
spatial  increments  Ax,  there  is  missing  data,  for  example,  at  finer  sample  rates. 
We  may  again  employ  information  theory  to  optimally  deploy  additional  sensors, 
with  the  purpose  of  optimally  completing  the  data,  and  deployment  of  additional 
sensors  is  terminated  when  the  expected  information  gain  saturates.  This 
framework  has  been  developed  for  multi-sensor  UXO  sensing. 

Publications 

The  research  reported  here  has  resulted  in  several  publications,  in  leading  journals.  It  has 
also  been  published  at  the  foremost  conferences  in  machine  learning  (/.<?.,  many  of  the 
ideas  developed  here  are  novel  at  the  basic  statistics/machine-learning  level,  beyond  the 
specific  applicability  for  UXO).  As  requested  by  SERDP,  we  have  recently  written 
papers  that  are  targeted  directly  to  the  UXO  community,  to  improve  the  accessibility  of 
the  ideas  we  have  developed.  Those  papers  are  appended  here  as  a  part  of  this  final 
report.  The  specific  papers  are: 

•  Y.  Zhang,  X.  Liao  and  L.  Carin,  “Detection  of  buried  targets  via  active  selection 
of  labeled  data:  application  to  sensing  subsurface  UXO,”  IEEE  Trans.  Geosc. 
Remote  Sensing,  vol.  42,  pp.  2535-2543,  Nov.  2004. 

•  D.  Williams,  C.  Wang,  X.  Liao  and  L.  Carin,  “Classification  of  unexploded 
ordnance  using  incomplete  multi-sensor  multiresolution  data,”  accepted  for 
publication  in  IEEE  Trans.  Geoscience  &  Remote  Sensing 


•  Q.  Liu,  X.  Liao  and  L.  Carin,  “Detection  of  unexploded  ordnance  via  efficient 
semi-supervised  and  active  learning,”  submitted  to  IEEE  Trans.  Geoscience  & 
Remote  Sensing 

•  X.  Liao  and  L.  Carin,  “Migratory  logistic  regression  for  learning  concept  drift 
between  two  data  sets:  Application  to  UXO  sensing,”  submitted  to  IEEE  Trans. 
Geoscience  &  Remote  Sensing 


Transitions 


Many  of  the  ideas  developed  under  this  SERDP  project  are  being  utilized  within  the 
context  of  an  ESTCP  project  being  executed  by  Signal  Innovations  Group,  Inc.,  where 
they  are  being  demonstrated  on  actual  sites. 


Detection  of  Buried  Targets  via  Active  Selection  of  Labeled  Data: 
Application  to  Sensing  Subsurface  UXO 

Yan  Zhang,  Xuejun  Liao  and  Lawrence  Carin 
Department  of  Electrical  and  Computer  Engineering 
Duke  University 
Box  90291 

Durham,  NC  27708-0291 

Abstract  -  When  sensing  subsurface  targets,  such  as  landmines  and  unexploded 
ordnance  (UXO),  the  target  signatures  are  typically  a  strong  function  of  environmental 
and  historical  circumstances.  Consequently,  it  is  difficult  to  constitute  a  universal 
training  set  for  design  of  detection  or  classification  algorithms.  In  this  paper  we  develop 
an  efficient  procedure  by  which  information-theoretic  concepts  are  used  to  design  the 
basis  functions  and  training  set,  directly  from  the  site-specific  measured  data. 
Specifically,  assume  that  measured  data  (e.g.,  induction  and/or  magnetometer)  are 
available  from  a  given  site,  unlabeled  in  the  sense  that  it  is  not  known  a  priori  whether  a 
given  signature  is  associated  with  a  target  or  clutter.  For  N  signatures  the  data  may  be 
expressed  as  {x(  ,  y(  }(=l  /v  ,  where  X;  is  the  measured  data  for  buried  object  i  and  y,  is  the 

associated  unknown  binary  label  (target/non-target).  Let  the  N  x,  define  the  set  X.  The 
algorithm  works  in  four  steps:  (i)  The  Fisher  information  matrix  is  used  to  select  a  set  of 
basis  functions  for  the  kernel-based  algorithm,  this  step  defining  a  set  of  n  signatures 
B;I  c:  Xthat  are  most  informative  in  characterizing  the  signature  distribution  of  the  site; 

(ii)  the  Fisher  information  matrix  is  used  again,  to  define  a  small  subset  X  v  c:  X , 

composed  of  those  x,  for  which  knowledge  of  the  associated  labels  y,  would  be  most 
informative  in  defining  the  weights  for  the  basis  functions  in  B„;  (iii)  the  buried  objects 
associated  with  the  signatures  in  Xv  are  excavated,  yielding  the  associated  labels  y„ 
represented  by  the  set  YiS;  and  (iv)  using  B„,  Xv  and  Yv  a  kernel-based  classifier  is 
designed,  for  use  in  classifying  all  remaining  buried  objects.  This  framework  is  discussed 
in  detail,  with  example  results  presented  for  an  actual  buried-UXO  site. 
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I.  INTRODUCTION 

It  is  well  known  that  sensor  signatures  of  buried  targets  such  as  landmines  and 
unexploded  ordnance  (UXO)  are  a  strong  function  of  their  history  and  soil  environment. 
For  example,  radar  and  seismic  sensing  of  landmines  is  a  strong  function  of  the  soil 
properties  [1].  Electromagnetic  induction  (EMI)  and  magnetometer  [2,3]  sensors  are 
typically  less  sensitive  to  soil  properties  when  the  target  is  of  high  metal  content,  such  as 
UXO.  However,  the  complexity  of  the  UXO  sensing  problem  is  strongly  influenced  by 
which  ordnance  are  present,  on  how  the  ordnance  impacted  the  soil,  and  on  the 
surrounding  man-made  conducting  clutter  and  UXO  fragments.  All  of  these  issues  are 
dependent  on  the  history  of  a  given  UXO  site. 

These  characteristics  of  the  subsurface-sensing  problem  significantly  complicate 
design  of  detection  and  classification  algorithms,  since  it  is  difficult  to  define  a  set  of 
landmine  or  UXO  sensor  signatures  that  are,  for  algorithm- training  purposes  [4], 
generally  representative  (for  all  landmine  and  UXO  sites).  In  this  paper  we  investigate  a 
technique  whereby  detection  and  classification  algorithms  may  be  designed  for  sensing 
buried  landmines  and  UXO  without  requiring  a  separate  training  set  of  representative 
target  and  clutter  signatures.  The  approach  is  based  on  the  realization  that,  when  sensing 
landmines  and  UXO,  one  will  eventually  excavate  buried  targets  based  on  the  sensor 
data.  The  approach  developed  here  chooses  which  items  to  excavate  initially,  based  on 
their  importance  in  design  of  the  associated  detection  and  classification  algorithm. 

Let  [x;},= i,/v  represent  the  known  measured  signatures  of  the  N  subsurface  objects 
at  a  given  site,  with  the  set  of  all  x,  denoted  as  X.  Further,  let  {>’(},=i,/v  represent  the 
associated  unknown  binary  labels  (target/non-target)  of  the  signatures,  to  be  determined 
in  the  detection  phase.  We  here  develop  a  kernel-based  classifier,  by  which  an  observed 
signature  or  feature  vector  x  is  classified  using  the  function 

f(x)  =  fjwiK(x,bi)  +  w0  (1) 

1=1 

where  b,  is  the  ith  basis  function,  w,  are  scalar  weights,  wa  is  a  scalar  offset  or  bias,  and 
X(x,bi)  is  a  general  kernel  that  defines  the  similarity  of  x  to  b,.  For  a  prescribed  threshold 
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t,  x  is  deemed  associated  with  the  +1  class  if  /(x)  >  t ,  and  associated  with  the  -1  class  if 
f(x)<t,  and  by  varying  the  threshold  !  one  yields  the  receiver  operating  characteristic1 
(ROC)  [4].  Algorithms  that  utilize  the  form  in  (1)  include  the  support  vector  machine 
(SVM)  [5-6],  the  relevance  vector  machine  (RVM)  [7],  kernel  matching  pursuits  (KMP) 
[8],  as  well  as  many  other  related  algorithms  [9-11]. 

In  design  of  a  classifier  of  the  form  in  (1),  the  b,  typically  come  from  a  separate 
training  set,  for  which  the  associated  labels  y,  are  known.  In  this  case  the  goal  is  to  design 
a  classifier  of  the  form  in  (1)  that  correctly  identifies  the  labels  of  the  training  data  (when 
x=b,  for  any  i),  with  the  hope  that  this  will  generalize  to  data  x  not  observed  while 
training.  The  aforementioned  variability  of  subsurface-target  signatures  makes  the  idea  of 
utilizing  a  separate  training  set  undesirable  and  often  impractical. 

In  the  approach  taken  in  this  paper,  the  set  of  basis  functions  B(I  =  {b,],=1„is 
selected  from  the  set  of  observed  data  X,  i.e.  BncX.  This  is  done  because  such  basis 
functions  will  be  well  matched  to  the  data  to  be  classified,  vis-a-vis  other  data  that  may 
have  come  from  a  different  site.  The  set  B  is  defined  by  selecting  those  signatures  from 

X  that  are  most  representative  of  the  measured  data  from  the  site  of  interest,  using 
fundamental  information-theoretic  considerations  to  be  detailed  below.  Note  that  the 
labels  (identities)  of  the  subsurface  objects  associated  with  B(1  are  not  required  at  this 
point.  Having  defined  the  basis  set  for  (1),  we  must  determine  the  associated  model 
weights  {w,.}/=1(,  and  w0  (denoted  collectively  by  the  vector  w),  and  for  this  task  we 

require  labeled  data.  Therefore,  we  define  a  subset  of  signatures  Xs  cz  X  for  which 

knowledge  of  the  associated  labels  Lv  would  be  most  informative  in  the  context  of 
defining  the  model  weights.  The  set  of  signatures  X.v  is  again  determined  via  information- 
theoretic  metrics  detailed  below.  Note  that  the  sets  Bn  and  X.s  may  overlap,  but  they  are  in 
general  distinct.  After  excavating  the  items  associated  with  X.s,  yielding  Ls,  the  algorithm 

1  Rigorously  speaking,  the  ROC  was  originally  developed  for  a  likelihood-ratio  test  [4],  and  therefore  what 
we  consider  here  is  arguably  ROC-like.  For  notational  convenience,  throughout  we  refer  to  such  as  an 
ROC. 
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in  (1)  is  trained  as  usual  [8]  and  then  applied  to  x  g  Xs .  The  key  point  is  that  the  training 

set  (XV,LS)  is  determined  adaptively  on  the  observed  site-dependent  data,  via  fundamental 
information-theoretic  metrics. 

We  demonstrate  using  measured  EMI  and  magnetometer  data  from  an  actual 
UXO  site  that  the  sets  Xs  and  Ls  are  often  of  small  dimension,  thereby  minimizing  the 
amount  of  excavation  required  for  algorithm  design.  Once  the  algorithm  has  been 
designed,  the  fact  that  it  is  well  matched  to  the  environment  often  yields  a  significant 
reduction  in  the  false-alarm  rate,  thereby  ultimately  reducing  the  total  number  of 
excavations  (i.e.  the  false-alarm  probability  is  reduced,  and  therefore  less  excavation  is 
required  of  clutter). 

The  remainder  of  the  paper  is  organized  as  follows.  In  Sec.  II  we  discuss  the 
selection  of  basis  functions  B(1  and  labeled  data  Xv.  We  present  in  Sec.  Ill  example 

results  on  EMI  and  magnetometer  detection  of  UXO  from  an  actual  UXO  site.  The  work 
is  summarized  in  Sec.  IV. 

II.  ACTIVE  CLASSIFIER  DESIGN 


A.  Model  structure 

The  decision  function  in  (1),  using  n  basis  functions,  may  be  expressed  concisely 

as  [8] 

/„(x)  =  J  wniK(x,b,)  +  wn0  =  w>„(x)  (2) 

i= 1 

where 

<h,(x)  =  [1,  K(  x,b1),X(x,b2),...,X(  x,b„)]r  (3) 

W„  =  [Wn,0’Wn,l’  K,2’  •••’  WnJ  (4) 

Assume  that  the  basis  set  B;1  ={b1,b2,...,bn}  is  known.  Moreover,  assume  that  the  item 
associated  with  signature  x,  is  excavated  (this  is  termed  an  “experiment”),  from  which  we 
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learn  the  associated  label  y{,  where  by  construction  y,=  l  for  one  class  and  y,=-l  for  the 
other  class  (target/no-target).  The  label  found  by  the  experiment  is  related  to  the 
prediction  /„(x)  by 

yt  =w>B(xI.)  +  eI.  (5) 

where  s(x;)  is  the  error  term  resulting  from  imperfections  in  the  model.  In  algorithm 

design  one  desires  weights  w  that  minimize  the  error  observed  on  training  data,  for  which 
the  data  and  labels  are  known.  If  the  training  data  are  well  matched  to  the  subsequent 
testing  data,  then  the  algorithm  is  likely  to  constitute  a  robust  detection  procedure.  As 
indicated  above,  in  many  subsurface-sensing  problems  it  is  impractical  to  have  a  separate 
training  set,  with  this  addressed  by  the  information-theoretic  techniques  discussed  below. 

B.  Selection  of  basis  functions 

If  we  assume  that  the  s,  in  (5)  are  Gaussian  and  independent  with  variance  a,2 ,  then 
the  Fisher  information  matrix  associated  with  X  and  B  is  expressed  as  [12] 


m  =  E",  <Vi,  (*,  x  (*, ) = Z“,  A1., 


(6) 


where  (j)ni  =  ()>„(x,) .  Note  that  in  computing  M„  we  do  not  require  the  labels  associated 

with  B„  and  X  (this  is  a  result  of  the  fact  that  the  model  in  (2)  is  linear  in  the  weights  w„). 
As  discussed  by  Fedorov  [12],  the  Fisher  information  matrix  in  (6)  is  associated  with  the 
uncertainty  in  the  model  weights  w,  as  defined  through  all  N  measured  x,  and  the  basis 
B„.  By  appending  a  new  basis  function  to  (]),,(•) ,  one  obtains 


4>»+iO 


<>„(•) 

4>b+i(0 


(7) 


where  <|)n+1(-)  =  K(  ■  ,b(I+1)  and  b„+1  eX,  b„+1  g  B„ .  Following  (2),  we  can  write  from 
<|)f!+1  the  augmented  classifier  fn , , ,  for  which  the  Fisher  information  matrix  is  found  to  be 
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where  cj))i+1(  =  (j)n  M  (x() .  The  expression  in  (8)  is  again  associated  with  fitting  the  model  to 

the  N  measured  x,-,  but  now  using  an  (n+l)-dimensional  basis  B„+i,  vis-a-vis  the  n- 
dimensional  basis  B„  in  (6).  We  develop  a  metric  which  compares  (6)  and  (8),  thereby 
quantifying  the  information  gain  by  adding  the  new  basis  b„+i. 


There  are  many  ways  of  comparing  the  information  content  reflected  by  M„  and 
M„+i,  and  here  we  employ  the  so-called  D-optimal  procedure  [12],  defined  as  the 
determinant  of  the  information  matrix.  The  logarithm  of  the  determinant  of  M  is  denoted 
qn,  and  it  may  be  shown  that 

<7n+i  =  <ln  +lnr(b„+i)  (9) 

where 


Kb„+i)  =  £iO[2Cu 


Since  N  >  n  the  matrix  M„  is  full  rank  and  its  inverse  exists  (assuming  that  n  of  the 
vectors  {<])„  (x, )  };=1  N  are  linearly  independent).  Under  these  conditions,  it  can  be  shown 


that  r  >  0,  and  therefore  lnr  in  (9)  is  generally  valid.  Note  that  if  r(b,!+i)=0,  then  M„+i  is 
rank-deficient,  and  we  delete  the  basis  function  from  the  candidate 
set  and  proceed  to  select  from  the  remaining  ones  (however,  we  haven’t  found  r(b„+i)=0 
in  practice). 


It  is  known  from  information  theory  [13]  that  the  inverse  of  Mn  gives  the  Cramer- 
Rao  lower  bound  (CRLB)  of  the  covariance  matrix  of  the  estimate  of  wn  and  the 

reciprocal  of  qn  lower  bounds  the  product  of  its  eigenvalues.  The  CRLB  is  here  the 
actual  covariance,  assuming  the  Gaussian  model.  A  larger  qn  implies  low  variances  of  the 
components  of  w„ .  Given  the  nth  order  decision  function  fn ,  qn  is  fixed,  and  one  relies 

on  maximization  of  lnr(b)I+1)  to  obtain  a  large  value  of  qn+\.  This  can  be  achieved  by 
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conducting  a  “greedy”  search  for  the  new  b)I+1  in  X  with  the  previously  selected  support 
data  excluded 

b„+1  =  argmaxbeX  bgB  lnr(b)  (11) 

Using  the  procedure  outlined  above,  basis  elements  b„  are  added  until  the  information 
gain  reflected  in  qn+i-qn  is  no  longer  deemed  significant.  Note  from  (9)-(10)  that 
evaluation  of  (11)  does  not  require  knowledge  of  the  target  labels  y-„  and  therefore  no 
excavation  is  required  to  determine  the  basis  B„.  The  greedy  method  is  suboptimal,  but  in 
practice  often  provides  good  results. 

C.  Selection  of  labeled  data,  for  model  training 

Assume  that  the  procedure  discussed  above  selects  n  bases  from  the  observed  data 
X.  We  now  require  labeled  data  to  optimize  the  associated  model  weights  w.  In  a  manner 
analogous  to  the  previous  discussion,  we  select  those  x;  e  X  for  which  knowledge  of  the 
associated  labels  y,  would  be  most  informative  in  the  context  of  defining  w.  Those  x;  that 
are  so  selected  define  a  subset  of  signatures  Xs  cz  X,  and  these  items  are  excavated  to 

yield  the  respective  set  of  labels  Lv.  The  set  of  signatures  and  labels  (Xv,  Lv)  are  then  used 
to  define  the  weights  w  in  a  least-squares  sense,  and  the  resulting  model  fix)  is  used  to 
specify  which  of  the  remaining  signatures  x  g  Xs  are  likely  targets  of  interest. 


Assume  that  there  are  J  signatures  in  Xv,  denoted  XSiJ.  We  quantify  the 
information  context  in  Xsj  in  the  context  of  estimating  the  model  weights  w,  and  further 
ask  which  x,  g  XJ  y  would  be  most  informative  if  it  and  its  label  were  added  for 

determination  of  w.  Analogous  to  (6),  we  have 


M,(Xm)  =  2„.x  V?*.M 


(12) 


The  expressions  in  (6)  and  (12)  both  employ  an  n-dimcnsional  basis  set  Bn  c  X .  The 

distinction  is  that  in  (6)  we  are  interested  in  defining  B,„  and  we  sum  over  all  observed 
signatures  { x, }  ,=i By  contrast,  in  (12)  the  basis  set  B„  is  known  and  fixed,  and  we  are 
only  summing  over  those  signatures  Xsj  for  which  knowledge  of  the  associated  labels  is 
most  informative  in  defining  the  model  weights  w. 
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After  adding  a  new  signature  x,.  e  X ,  x;  g  Xs , ,  we  now  have  X 


and  M„  is 


s,J  + 1 


updated  as 


Mn(Xs,7+1)=M„(Xj,/)  +  a:2+i(|,,I,.+i<.+i 


(13) 


where  ij+\  represents  the  index  of  the  new  signature  selected  for  X,,y+i.  Using  the  matrix 
identity  det(A+FFr)=det(I+FrA_1F)det(A),  one  obtains  from  (13) 


Cln(X  s,J+l 


)  =  <ln(Xs,j)  +  lnP(XiJJ 


(14) 


with 


p(x.J  =  l  +  a:;+iC.;iM;1(X^)(|,,,.r 


(15) 


Care  is  needed  with  regard  to  evaluating  the  inverse  of  M,„  since  if  J<n  the  matrix  is  rank 
deficient.  We  have  considered  addressing  this  in  either  of  two  ways.  A  standard  approach 
for  inversion  of  such  matrices  is  to  add  a  small  diagonal  term  to  M„,  such  that  its  inverse 
exists.  Alternatively,  by  construction  one  can  assume  that  the  items  associated  with  the 
basis  B„  are  all  associated  with  Xsj,  yielding  a  minimum  of  n  labeled  data  and  therefore 
assuring  that  the  matrix  is  full  rank.  We  have  examined  both  procedures,  and  they  yield 
comparable  results.  We  use  the  second  approach  in  all  examples  presented  in  Sec.  III. 


Having  addressed  the  inverse  of  M,„  one  iteratively  maximizes  lnp(x(.  )  to 

obtain 

x,,+1  =  argmaxxeX  xgX  j  lnp(x)  (16) 

Note  that  to  define  x,  we  again  do  not  require  the  signature  labels.  The  elements  of  Xv 

are  selected  iteratively,  in  a  “greedy”  fashion  as  indicated  in  (16),  until  the  information 
gain  is  below  a  prescribed  threshold.  After  J  iterations  we  have  defined  those  signatures 
Xsj  for  which  knowledge  of  the  labels  will  best  approximate  the  weights  w.  These  items 
are  excavated,  yielding  the  labels  LsJ . 

For  the  assumptions  underlying  the  linear  model  in  (5),  and  assuming  knowledge 
of  B„  and  (Xsj,  LVij)  the  optimal  estimation  for  the  weights  w  is  expressed  as  [8,12] 

w  =  [a>Ta>]  o>Ty 
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(17) 


where  y  represents  the  set  of  labels  determined  via  the  J  excavations 

y  =  {)v)v- 

and  the  J  x(n  + 1)  matrix  d>  is  defined  as 

<(x(1) 
f'(x,2) 

<D  = 

where,  for  example,  x,  corresponds  to  yh  . 

In  the  classification  stage  we  consider  x?XsJ  and  compute /(x).  For  a  prescribed 
threshold  t,  x  is  deemed  associated  with  the  +1  class  if  /(x)  >  1 ,  and  associated  with  the  - 
1  class  if  and  by  varying  the  threshold  1  one  yields  the  receiver  operating 

characteristic  (ROC)  [4].  The  key  component  of  the  model  /(x)  is  that  it  is  linear  in  the 
weights  w,  which  yields  a  closed- form  procedure  for  selection  of  B„  and  X.v, j,  as  indicated 
in  the  previous  sections. 

III.  THEORETICAL  MOTIVATION  FOR  CLASSIFIER  DESIGN 

In  the  previous  two  sections  we  have  presented  procedures  for  selecting  basis  functions 
for  a  kernel-based  classifier,  based  on  a  set  of  unlabeled  data.  After  designing  the  basis 
set,  we  have  also  addressed  selection  of  which  signatures  would  be  most  informative  for 
classifier  training,  if  the  associated  signature  labels  were  known.  In  this  section  we 
provide  theoretical  justification  for  these  design  procedures,  and  in  Sec.  IV  example 
results  are  presented  for  UXO  sensing. 


(18) 


(19) 
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A.  Basis-function  selection 


To  simplify  notation,  we  utilize  matrix  expressions  in  our  derivation.  Let  the  basis 
functions  <)>„(•)  be  evaluated  for  all  initially  unlabeled  data  points  { x,  }I=i.,v  ,  and  stacked 

to  form  the  matrix  =  [<()),(x1  ),§?I (x2 (|)n(xA,)]T .  Let  the  data  labels  be  denoted 


v  =  {_vl,y2,...,y/v}1  ,  although  these  labels  are  not  required  when  designing  the  basis 

functions.  The  difference  between  the  true  labels  and  those  output  by  the  classifier  (2), 
for  all  {x;},=i^  is  expressed  in  vector  form  as 

1 


*(IN-®n(Un+1+®Tn®nyl®Tn)y  (20) 


where  is  a  N  x  N  identity  matrix  (I„  is  defined  similarly),  and  A,  is  a  small  positive 

number.  The  equality  3  in  (20)  is  due  to  the  Sherman-Morrison-Woodbury  formula.  From 
(20)  the  squared  error  between  the  true  and  estimated  labels  is 


yT(i,v+i®>T2y 


(21) 


The  expression  in  (21)  shows  that  for  given  basis  functions  <()„(•)  we  have 
approximately  expressed  the  squared  error  as  a  quadratic  form  of  the  labels  y ,  with  a 
coefficient  matrix  C“2  with  Cn  =  IN  +  /X .  The  approximation  can  be  made  as 
accurate  as  desired  by  making  X  sufficiently  small.  Without  knowing  y  ,  we  prefer  C„  to 
have  large  eigenvalues,  to  make  the  error  en  small.  This  is  accomplished  by  making  the 
determinant  of  C„  large.  The  logarithmic  determinant  of  C„  is 

1 


<Z«2)  =lndet(C,!)  =  lndet(IJV  +^-OfiO^) 


X 


iln'kW.-i 


n+ 1 


=  ln 


X 

det(XI„+1+M„) 


(22) 


X 


n+ 1 


to 


where  equality  3  is  due  to  the  property  of  matrix  determinants  and  equality  4  is  due  to  (6). 
Adding  a  new  basis  function  to  <)>„(•),  we  get  c|>„+1(-)  as  given  in  (7).  The  logarithmic 


determinant  of  CB+1  =  IN  +  On+10Tn+l/X  is 

=ln 


(2)  _^deWn+2+Mn+l) 


x 


n+2 


(23) 


Following  the  method  of  obtaining  (9)-(10),  we  can  show  that  q2’  and  q(2+\  are  related 
by 

^(2)(^I+1) 


^i=ln^2)+ln- 


X 


(24) 


with 


„(2) 


(♦„.)  =  x  +  X'l.+Lu  -  EI1<lwl>L'(M..i  +  M.)-‘  X"  M..U  (25) 


where  ())„_,•=  (|)„(x(.)  and  tyn+u  =  ^(x,.) .  Since  we  wish  for  a  C„+1  with  large 

r(2U+1) 


determinant,  we  want  to  make  In 


or  equivalently  In  r(2)((j))I+1)  large,  as  A,  is  a 


X 


constant. 


Comparing  (25)  to  (10),  we  find  that  r(2)  is  approximately  equal  to  r  when  X  is 
small.  Since  X  can  be  made  as  small  as  desired,  the  approximation  can  be  made 
arbitrarily  accurate.  Therefore  the  basis  function  obtained  in  (11)  is  the  one  that 
minimizes  the  determinant  of  Cn+1  given  Cn ,  which  in  consequence  will  minimize  the 

eigenvalues  of  Cn+1  ,  minimizing  the  squared  error  e2+1  . 


B.  Selection  of  examples  for  labeling 

Assume  the  basis  functions  ())„(•)  have  been  selected  in  the  manner  discussed 
above.  Moreover,  assume  we  have  selected  the  subset  Xs  y  =  {x(.  ,x,.  }  of  J 

signatures  for  which  the  associated  labels  will  be  acquired.  The  Fisher  information  matrix 


11 


If  p<2,(x)  «  1 ,  we  do  not  require  the  label  for  x ,  as  it  does  not  important  for  inclusion  in 
the  training  set.  On  the  other  hand,  if  p<2)(x) » 1,  inclusion  of  x  in  the  training  set  is 
important.  Therefore,  the x  that  maximizes  pt2)(x)  should  be  selected  to  seek  the 
associated  label  y.  Comparing  (29)  to  (15)  we  note  that  p<2)(x)  is  exactly  equivalent  to 
p(x) ,  and  thus  the  x  that  maximizes  p(x)  is  the  one  that  contributes  the  maximally  to 
make  the  squared  test  error  small. 


12 


IV.  APPLICATION  TO  UXO  DETECTION 


The  active-training  methodology  addressed  in  this  paper  may  be  applied  to  any 
detection  problem  for  which  the  data  labels  are  expensive  to  acquire,  and  for  which  there 
is  no  distinct  training  data.  In  particular,  we  consider  the  detection  of  buried  UXO.  For 
UXO  remediation,  the  label  of  a  potential  target  is  acquired  by  excavation,  a  dangerous 
and  time  consuming  task.  The  overwhelming  majority  of  UXO  cleanup  costs  come  from 
excavation  of  non-UXO  items.  In  this  context,  note  that  a  priori  excavations  are  required 
for  the  procedure  in  Sec.  II  (to  obtain  labeled  training  data).  However,  if  the  false-alarm 
rate  is  reduced  at  the  desired  detection  probability,  then  overall  cleanup  costs  may 
diminish  substantially  ( i.e .,  overall,  less  non-UXO  items  need  be  excavated). 

The  results  presented  here  are  for  data  collected  at  an  actual  UXO  site:  Jefferson 
Proving  Ground  in  the  United  States.  The  technique  in  Sec.  II  is  compared  with  results 
obtained  using  existing  procedures.  Specifically,  the  principal  challenge  in  UXO  sensing 
is  development  of  a  training  set,  for  design  of  the  detection  algorithm.  At  an  actual  UXO 
site  there  is  often  a  significant  quantity  of  UXO,  UXO  fragments  and  man-made  clutter 
on  the  surface.  It  has  been  recognized  that  the  characteristics  of  the  surface  UXO  and 
clutter  is  a  good  indicator  of  what  will  be  found  in  the  subsurface.  Consequently,  in 
practice,  a  subset  of  the  surface  UXO  and  clutter  are  buried,  and  magnetometer  and 
induction  data  are  collected  for  these  items,  for  which  the  labels  are  obviously  known. 
The  measured  data  and  associated  labels  (UXO/non-UXO)  are  then  used  for  training 
purposes.  Of  course,  the  process  of  burying,  collecting  data,  and  then  excavating  these 
emplaced  items  is  time  consuming  and  dangerous  (for  the  UXO  items),  with  this 
procedure  eliminated  by  the  techniques  outlined  in  Sec.  II. 


A.  Magnetometer  and  electromagnetic  induction  sensors 

Magnetometer  and  electromagnetic  induction  (EMI)  sensors  are  widely  applied  in 
sensing  buried  conducting/ferrous  targets,  such  as  landmines  and  UXO.  The 
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magnetometer  is  a  passive  sensor  that  measures  the  change  of  the  earth’s  background 
magnetic  field  due  to  the  presence  of  a  ferrous  target.  Magnetometers  measure  static 
magnetic  fields.  An  EMI  sensor  actively  transmits  a  time- varying  electromagnetic  field, 
and  consequently  senses  the  dynamic  induced  secondary  field  from  the  target.  To 
enhance  soil  penetration,  EMI  sensors  typically  operate  at  kilohertz  frequencies.  We  here 
employ  a  frequency-domain  EMI  sensor  that  transmits  and  senses  at  several  discrete 
frequencies  [15].  Magnetometers  only  sense  ferrous  targets,  while  EMI  sensors  detect 
general  conducting  and  ferrous  items. 

Parametric  models  have  been  developed  for  both  magnetometer  and  EMI  sensors 
[16-18].  The  target  features  x  are  extracted  by  fitting  the  EMI  and  magnetometer  models 
to  measured  sensor  data.  The  vector  x  has  parameters  from  both  the  magnetometer  and 
EMI  data,  and  therefore  in  this  sense  the  data  from  these  two  sensors  are  “fused”.  The 
one  place  where  these  two  models  have  overlapping  parameters  is  in  specification  of  the 
target  position.  The  magnetometer  data  often  yields  a  very  good  estimation  of  the  target 
position,  and  therefore  such  are  used  in  x.  In  fact,  the  target  position  specified  by  the 
magnetometer  data  is  explicitly  utilized  as  prior  information  when  fitting  EMI  data  to  the 
EMI  parametric  model.  Details  on  the  magnetometer  and  EMI  models,  and  on  the  model¬ 
fitting  procedure,  may  be  found  in  [18]. 

The  features  employed  are  as  in  [18],  and  the  features  are  centered  and 
normalized.  Specifically,  using  the  training  data,  we  compute  the  mean  feature  vector 

'y 

xmean  and  the  variance  of  each  feature  component  (let  oj  represent  the  variance  of  the  ith 
feature).  Before  classification,  a  given  feature  vector  x  is  shifted  by  implementing 
xshift=x-Xmean  ,  and  then  the  ith  feature  component  of  xshift  is  divided  by  oi  to  effect  the 
normalization. 

B.  Measured  sensor  data  from  the  Jefferson  Proving  Ground 

Jefferson  Proving  Ground  (JPG)  is  a  former  military  range  that  has  been  utilized 
for  UXO  technology  demonstrations  since  1994.  We  consider  data  collected  by  Geophex, 
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Ltd.  in  the  latest  phase  (Phase  V)  of  the  JPG  demonstration.  The  goal  of  the  JPG  V  is  to 
evaluate  the  UXO  detection  and  discrimination  abilities  under  realistic  scenarios,  where 
man-made  and  natural  clutter  coexist  with  UXO  items.  Our  results  are  presented  with  the 
GEM-3  and  magnetometer  data  from  two  adjoining  areas,  constituting  a  total  of 
approximately  five  acres.  There  are  433  potential  targets  detected  from  sensor  anomalies, 
40  of  which  are  proven  to  be  UXO  and  the  others  are  clutter.  The  excavated  UXO  items 
include  4.2  inch,  60  mm,  and  81mm  mortars;  5  inch,  57  mm,  76  mm,  105  mm,  152  mm, 
and  155  mm  projectiles;  and  2.75  inch  rockets. 

This  test  was  performed  with  US  Army  oversight.  One  of  the  two  JPG  areas  was 
assigned  as  the  training  area,  for  which  the  ground  truth  (UXO/non-UXO)  was  given. 
The  trained  detection  algorithms  are  then  tested  on  the  other  area,  and  the  associated 
ground  truth  was  revealed  later  to  evaluate  performance.  It  was  subsequently  recognized 
that  several  UXO  types  were  found  in  equal  number  in  each  of  the  two  areas.  This 
indicates  an  effort  to  match  the  training  data  to  the  detection  data,  in  the  manner 
discussed  above,  involving  burial  of  known  UXO  and  non-UXO  collected  on  the  surface. 

Each  sensor  anomaly  is  processed  by  fitting  the  associated  magnetometer  and 
EMI  data  to  the  parametric  models  [18],  and  the  estimated  parameters  define  x.  In 
addition,  the  model-fitting  procedure  functions  as  a  prescreening  tool.  Any  sensor 
anomaly  failing  to  fit  well  to  the  model  is  regarded  as  having  been  generated  by  a  clutter 
item.  Therefore,  a  total  of  300  potential  targets  remain  after  this  prescreening  stage,  40  of 
which  are  UXO.  In  the  training  area,  there  are  128  buried  items,  16  of  which  are  UXO. 

C.  Detection  results 

Before  presenting  classification  results,  we  examine  the  characteristics  of  the 
basis  functions  selected  in  the  first  phase  of  the  algorithm,  prior  to  adaptively  selecting 
training  data.  In  Fig.  1  we  consider  the  first  three  basis  functions  bi,  bo  and  b3  selected  by 
the  first  stage  of  the  algorithm.  For  each  feature  vector  x  (from  all  UXO  and  non-UXO), 
we  compute  a  three-dimensional  vector  (X(x,b|),  X(x,b2),  Xtx.ba)).  By  examining  this 
three-dimensional  vector  for  all  x,  we  may  observe  the  degree  to  which  UXO  and  non- 
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UXO  features  are  distinguished  via  the  features  and  kernel.  A  radial  basis  function  kernel 
is  employed  here,  corresponding  to  the  kernel  used  to  select  the  basis  functions  (see 
discussion  below  concerning  the  selected  kernel).  By  examining  Fig.  1  we  observe  that 
the  UXO  and  non-UXO  features  are  relatively  separated,  although  there  is  significant 
overlap,  undermining  classification  performance. 

The  detection  results  are  presented  in  the  form  of  the  receiver  operating 
characteristic  (ROC),  quantifying  the  probability  of  detection  (Pd)  as  a  function  of  the 
false  alarm  count.  We  present  ROC  curves  using  the  adaptive-training  approach 
discussed  in  Sec.  II,  with  performance  compared  to  results  realized  by  training  on  the 
distinct  training  region  discussed  above  (the  latter  approach  reflects  current  practice). 
With  regard  to  conventional  training,  the  algorithm  employed  is  of  identical  form  as  (2), 
with  model  weights  determined  iteratively  using  kernel  matching  pursuits  (KMP).  Details 
on  the  KMP  algorithm  may  be  found  in  [8]  (we  have  employed  the  prefitting  algorithm  in 
[8]).  To  make  the  comparison  appropriate,  the  adaptive  training  and  KMP 
implementation  employ  an  identical  radial  basis  function  (RBF)  kernel  [10] 

K(x,  b)=  .  1  exp[-||x-bf  /a2]  (30) 

^2no2 

The  variance  a~  is  adaptively  adjusted  after  each  basis  vector  is  selected  before  the 
model  weights  are  determined,  and  it  is  not  related  to  the  labeled  training  data.  In 
particular,  a  gradient  search  is  applied  to  refine  a2  with  equation  (11),  by  maximizing 
lnr(b). 


As  indicated  above,  the  designated  training  area  has  128  labeled  items,  and 
conventionally  classifiers  are  tested  on  the  remaining  signatures,  in  this  case  constituting 
172  items.  As  a  first  comparison,  the  adaptive  technique  discussed  in  Sec.  II  is  employed 
to  select  7=128  items  from  the  original  300,  with  these  “excavated”  to  learn  the 
associated  labels.  This  therefore  defines  the  set  Xsj  and  associated  labels  LVjy.  The  basis 
set  B„  is  also  determined  adaptively  using  the  original  300  signatures,  and  here  n=  10.  The 
performance  of  the  adaptive  learning  algorithm  is  then  tested  on  the  remaining  172 
x,  £  Xa  /  ,  although  these  are  generally  not  the  same  testing  examples  used  via  traditional 
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training  of  the  KMP  algorithm  (the  training  sets  do  not  overlap  completely).  For 
comparison,  we  also  show  training  and  testing  results  implemented  via  KMP,  in  which 
the  128  training  examples  are  selected  randomly  from  the  original  300  signatures  X. 
Performance  comparisons  are  shown  in  Fig.  2,  wherein  we  present  results  for  active  data 
selection  (algorithm  in  Sec.  II),  KMP  results  using  the  assigned  128  training  examples, 
and  average  results  for  randomly  choosing  the  128  examples  for  KMP  training  (100 
random  selections  were  performed).  In  addition,  for  the  latter  case  we  also  place  error 
bars  on  the  results;  the  length  of  the  error  bar  is  twice  the  standard  derivation  of  the  Pd 
for  the  associated  false-alarm  count.  Therefore,  if  the  result  is  Gaussian  distributed,  95% 
of  the  values  lie  within  the  error  bar. 

Before  proceeding  we  note  that  the  ROC  curves  are  generated  by  varying  the 
threshold  t,  as  applied  to  the  estimated  label  y.  For  the  binary  UXO-classification 
problem  considered  here,  by  design  we  choose  the  label  y=l  for  UXO  and  y=0  for  non- 
UXO.  In  practice  one  must  choose  one  point  on  the  ROC  at  which  to  operate.  A  naive 
choice  of  the  operating  point  would  be  0.5  (. i.e .,  if  the  classifier  maps  a  testing  feature 
vector  x  to  a  label  y>0.5  the  item  is  declared  UXO,  and  otherwise  it  is  declared  non- 
UXO).  However,  we  must  account  for  the  fact  that  in  practice  the  number  of  non-UXO 
items  is  often  much  larger  than  the  number  of  UXO.  We  have  therefore  invoked  the 
following  procedure. 

We  assume  that  the  error  (noise)  between  the  true  label  (y=l  or  y=0)  and  the 
estimated  label  is  i.i.d.  Gaussian  with  variance  of  a  ,  as  in  (5).  Let  No  and  Ni  represent 
respectively  the  number  of  non-UXO  and  UXO  items  in  the  training  set.  Considering  the 
UXO  (y=  1 )  data,  an  unbiased  estimator  of  the  label  y  will  yield  a  mean  of  one  and  a 
minimum  variance  of  a  /N\.  Similarly,  considering  the  non-UXO  data  (  v=0),  an  unbiased 
estimator  of  the  label  y  will  have  zero  mean  and  minimum  variance  a  /No.  Let  Hi  and  Ho 
correspond  to  the  UXO  and  non-UXO  hypotheses.  Based  upon  the  above  discussion,  we 
model  the  probability  density  function  of  y  for  the  Hi  and  Ho  hypotheses  as 
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p(y|Hj)  =N(l,<j2  /  N{)  and  p(y|H0)  =M1  ,o2  /  N0) .  Rather  than  setting  the  threshold  at 
t= 0.5,  we  set  the  threshold  at  that  value  of  y  for  which  p(y|Hj)  =p(y|H0) ,  yielding 


2,  N  n 


t  = 


N,  -  iNf -(N, -NoXN,  +  ozln^) 


N.-Nq 


(31) 


.2l 


Assuming  a  is  a  small,  we  omit  azln^- ,  obtaining 


t  =  - 


■In', 


From  (32),  the  appropriate  threshold  is  /=0.5  only  if  Nq  =  N\. 


(32) 


For  example,  in  Fig.  2,  only  15  of  the  128  actively  selected  training  data  are 
UXO,  and  therefore  N\  =  15,  No  =  113.  If  we  set  the  threshold  to  be  /=0.5,  we  detect  16% 
of  the  UXO  with  two  false  alarms.  By  contrast,  using  the  procedure  discussed  above  (for 
which  t= 0.27),  we  detect  88%  of  the  UXO  with  25  false  alarms.  The  operating  point 
corresponding  to  t=0.21  is  indicated  in  Fig.  2.  We  similarly  plot  this  point  in  all 
subsequent  ROCs  presented  below. 


We  observe  from  the  results  in  Fig.  2  that  the  active  data  selection  procedure 
produces  the  best  ROC  results  (for  Pd>0.7,  which  is  of  most  interest  in  practice),  with  the 
KMP  results  from  the  specified  training  area  almost  as  good.  It  is  observed  that  the 
average  performance  based  on  choosing  the  training  set  randomly  is  substantially  below 
that  of  the  two  former  approaches,  with  significant  variability  reflected  in  the  error  bars. 
These  results  demonstrate  the  power  of  the  active-data- selection  algorithm  introduced  in 
Sec.  II,  and  also  that  the  training  data  defined  for  JPG  V  is  well  matched  to  the  testing 
data. 


In  the  first  example  we  set  7=128  to  be  consistent  with  the  size  of  the  training  area 
specified  in  the  JPG  V  test.  The  algorithm  in  Sec.  II  can  be  implemented  for  smaller 
values  of  J,  reflecting  less  excavation  required  in  the  training  phase  (for  determination  of 
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target  labels).  It  is  of  interest  to  examine  algorithm  performance  as  7  is  decreased  from 
128.  In  this  case  training  is  performed  using  signatures  and  labels  from  the  7  “excavated” 
items,  and  testing  is  performed  on  the  remaining  300-/  examples.  Results  are  presented 
for  the  active  training  procedure  and  for  randomly  choosing  7  training  examples  (100 
random  instantiations),  as  in  Fig.  2.  In  Figs.  3-5  results  are  presented  for  /=90,  60  and  40. 
Using  /=  90  rather  than  7=128  results  in  very  little  degradation  in  ROC  performance 
(comparing  Figs.  2  and  3),  with  a  slight  performance  drop  for  7=60,  and  a  more 
substantial  drop  for  7=40.  It  is  interesting  to  note  that  with  decreasing  7,  the  number  of 
test  items  300-7  increases,  therefore  increasing  the  number  of  false-alarm  opportunities. 
This  further  highlights  the  quality  of  the  results  in  Figs.  3-5,  vis-a-vis  Fig.  2.  In  all  of 
these  and  subsequent  examples,  the  size  of  the  basis  set  B„  is  n=\0. 

In  the  above  examples  7  was  specified  to  be  matched  to  the  size  of  a  specified 
training  set,  or  it  was  varied  for  comparison  to  such.  However,  the  procedure  in  Sec.  II 
may  be  employed  to  adaptively  determine  the  size  of  the  desired  training  set  Xsj,  based 
on  the  information  gain  as  7  is  increased.  Specifically,  we  track  qn{Xs  y) -qn(Xs  y-1)  for 

increasing  7,  and  terminate  the  algorithm  when  the  information  gain  is  minimal.  At  this 
point,  adding  a  new  datum  to  the  training  dataset  does  not  provide  significant  additional 
information  to  the  classifier  design. 

For  the  JPG  V  data,  the  information  gain  d„(Xs  y)  -qn(Xs  y-1)  is  plotted  in  Fig.  6 

as  a  function  of  7,  and  the  change  in  information  gain  is  given  in  Fig.  7  for  visualization 
assistance.  Based  on  Fig.  6-7  the  size  of  the  training  set  is  set  to  7=65.  In  Fig.  8  results  are 
shown  for  7=65,  with  comparison  as  before  to  KMP  results  in  which  the  7=65  training 
examples  are  selected  randomly.  Examining  the  results  in  Fig.  8,  we  observe  that  the 
active  selection  of  training  data  yields  a  detection  probability  of  approximately  0.95  with 
approximately  35  false  alarms;  on  average  one  encounters  about  five  times  this  number 
of  false  alarms  to  achieve  the  same  detection  probability  (when  selecting  the  training  data 
randomly). 
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V. 


CONCLUSIONS 


There  are  many  remote-sensing  problems  for  which  one  collects  data  from  a  given 
site,  and  the  task  is  to  specify  the  identity  of  the  object  responsible  for  each  signature 
( e.g .,  detection  and  classification).  Due  to  the  variability  and  site-dependent  character  of 
many  target  signatures,  it  is  often  difficult  to  have  reliable  training  data  a  priori  for 
algorithm  design.  In  this  paper  we  have  therefore  developed  an  information-theoretic 
framework  in  which  the  training  data  are  selected  adaptively  from  the  observed  site- 
dependent  data,  without  requiring  an  a  priori  training  set.  Specifically,  the  algorithm 
specifies  those  signatures  for  which  knowledge  of  the  associated  labels  (e.g.,  target/non¬ 
target)  would  be  most  relevant  in  the  context  of  detector  design.  An  “experiment”  is  then 
performed  to  leam  the  target  labels,  where  in  the  context  of  landmine  and  UXO  sensing 
this  corresponds  to  excavating  the  respective  buried  items.  This  is  a  reasonable 
procedure,  since  landmines  and  UXO  need  be  excavated  ultimately  anyway,  and 
therefore  the  algorithm  essentially  prioritizes  the  order  in  which  items  are  excavated,  with 
the  goal  of  ultimately  excavating  fewer  non-targets  (false  alarms)  via  proper  algorithm 
training.  The  algorithm  has  been  demonstrated  successfully  on  measured  magnetometer 
and  EMI  data  from  an  actual  former  bombing  range,  addressing  the  sensing  of  UXO. 

There  are  several  items  that  deserve  further  attention.  It  was  demonstrated  that  the 
gain  in  information  content  is  a  good  measure  of  which  items  should  be  excavated  for 
learning  of  associated  labels.  The  results  in  Figs.  6-8  demonstrated  the  effectiveness  of 
this  procedure,  although  the  actual  selection  of  the  number  of  training  examples,  J.  was 
determined  in  a  somewhat  ad  hoc  manner.  Further  work  is  required  to  make  this 
procedure  more  rigorous  and  automated. 

In  addition,  for  the  results  presented  here  the  detection  algorithm  was  trained 
once  using  the  adaptively  determined  training  set.  However,  in  the  subsequent  testing 
phase  a  “dig  list”  is  specified  for  those  items  that  are  deemed  to  be  associated  with 
targets  of  interest  (here  UXO).  Once  each  item  is  excavated,  and  the  associated  label 
revealed,  the  algorithm  should  be  successively  retrained  and  applied  to  the  remaining 
data.  The  order  of  the  dig  list  -  and  therefore  the  order  in  which  we  learn  the  labels  of  the 
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testing  data  -  is  also  of  interest  since  it  may  be  used  to  further  refine  the  algorithm 

sequentially,  as  a  given  site  is  cleaned  ( e.g .,  of  landmines  or  UXO). 
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Figure  Captions 


Figure  1.  For  the  first  three  basis  functions  selected,  bi,  b2,  and  b3,  the  three- 
dimensional  vector  (X(x,bi),  A'(x.b2j,  X(x.b;))  is  plotted.  Considered  are  feature  vectors  x 
for  all  UXO  and  non-UXO  targets  considered  in  this  study. 

Figure  2.  Receiver  operating  characteristic  (ROC)  curves  based  on  128  training 
examples,  for  which  the  target  labels  were  known.  In  one  case  the  training  set  was 
carefully  designed  a  priori ,  and  in  the  other  the  training  examples  were  chosen  adaptively 
using  the  algorithm  of  Sec.  II.  For  comparison,  results  are  also  shown  when  the  128 
training  examples  are  chosen  randomly,  100  times.  In  the  latter  case  average  results  are 
shown,  as  well  as  the  associated  range  of  variability.  The  indicated  point  on  the  ROC 
corresponds  to  (32). 

Figure  3.  As  in  Fig.  2,  but  now  results  are  only  shown  for  adaptive  training-data 
selection  (Sec.  II)  and  for  random  selection.  In  the  latter  case  results  are  presented  as  in 
Fig.  1.  Results  are  shown  for  7=90  training  examples. 

Figure  4.  As  in  Fig.  3,  with  7=60. 

Figure  5.  As  in  Fig.  4,  with  7=  40. 

Figure  6.  Information  gain  of  adding  a  new  datum,  as  a  function  of  the  number  of  the 
training  examples  7,  selected  adaptively. 

Figure  7.  Difference  in  the  information  gain,  as  a  function  of  the  number  of  training 
examples  7. 

Figure  8.  ROC  curves  based  on  7=65  training  examples,  comparing  the  adaptive 
procedure  (Sec.  II)  to  random  training  data  selection.  Number  of  training  examples 
chosen  based  on  Figs.  6-7. 
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Abstract 

To  achieve  good  generalization  in  supervised  learning,  the  training  and  testing  exam¬ 
ples  are  usually  required  to  be  drawn  from  the  same  source  distribution.  In  this  paper  we 
propose  a  method  to  relax  this  requirement  in  the  context  of  logistic  regression.  Assuming 
Vp  and  Va  are  two  sets  of  examples  drawn  from  two  different  distributions  T  and  A  (called 
concepts,  borrowing  a  term  from  psychology),  where  Va  are  fully  labeled  and  V"  partially 
labeled,  our  objective  is  to  complete  the  labels  of  Vp.  We  introduce  an  auxiliary  variable  // 
for  each  example  in  Va  to  reflect  its  mismatch  with  Vv .  Under  an  appropriate  constraint 
the  p's  are  estimated  as  a  byproduct,  along  with  the  classifier.  We  also  present  an  active 
learning  approach  for  selecting  the  labeled  examples  in  Vp.  The  proposed  algorithm,  called 
migratory  logistic  regression  (MigLogit),  is  demonstrated  successfully  on  simulated  data  as 
well  as  on  real  measured  data  of  interest  for  unexploded  ordnance  (UXO)  cleanup. 


1  Introduction 

In  supervised  classification  problems,  the  goal  is  to  design  a  classifier  using  the  training  exam¬ 
ples  (labeled  data)  such  that  the  classifier  predicts  the  labels  correctly  for  unlabeled  test  data. 
The  accuracy  of  the  predictions  is  significantly  affected  by  the  quality  of  the  training  examples, 
which  are  assumed  to  contain  essential  information  about  the  test  instances  for  which  predic¬ 
tions  are  desired.  A  common  assumption  utilized  by  learning  algorithms  is  that  the  training 
examples  and  the  test  instances  are  drawn  from  the  same  source  distribution. 

As  a  practical  example,  consider  the  detection  of  a  concealed  entity  based  on  sensor  data 
collected  in  a  non-invasive  manner.  This  problem  is  of  relevance  in  several  practical  problems. 
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for  example  in  the  medical  imaging  of  potential  tumors  or  other  hidden  anomalies.  In  the  con¬ 
text  of  remote  sensing,  one  is  often  challenged  with  the  problem  of  detecting  and  characterizing 
a  concealed  (e.g.,  underground)  target  based  on  remotely  collected  sensor  data.  In  an  example 
that  will  be  considered  explicitly  in  this  paper,  consider  the  detection  of  buried  unexploded 
ordnance  (UXO)  (Zhang  et  al.  2003).  An  unexploded  ordnance  is  a  bomb  that  did  not  explode 
upon  impact  with  the  ground,  and  such  items  pose  great  danger  if  disturbed  (excavated)  with¬ 
out  care.  Sensors  used  for  detecting  and  characterizing  UXO  include  magnetometers  and  elec¬ 
tromagnetic  induction  (Zhang  et  al.  2003).  In  designing  an  algorithm  for  characterization  of 
anomalies  detected  by  such  sensors,  to  determine  if  a  given  buried  item  is  UXO  or  clutter,  one 
typically  requires  training  data.  Such  training  data  typically  comes  from  other  former  bombing 
sites  that  have  been  cleaned,  and  there  is  a  significant  issue  as  to  whether  such  extant  labeled 
sensor  data  are  relevant  for  a  new  site  under  test.  The  challenge  addressed  in  this  paper  in¬ 
volves  learning  the  relevance  and  relationship  of  existing  labeled  (training)  data  for  analysis  of 
a  new  unlabeled  or  partially  labeled  data  set  of  interest.  This  type  of  problem  has  significant 
practical  relevance  for  UXO  sensing,  for  which  results  are  presented  on  measured  data,  as  well 
as  for  the  aforementioned  classes  of  problems,  for  which  there  is  uncertainty  concerning  the 
appropriateness  of  existing  labeled  data  for  a  new  set  of  unlabeled  data  of  interest. 

To  place  this  problem  in  a  mathematical  setting,  let  T (x,  y)  be  the  probability  distribution 
(or  concept,  borrowing  a  term  from  psychology  1 )  from  which  test  instances  (each  including  a 
feature  vector  x  and  the  associated  class  label  y)  are  drawn.  The  goal  in  classifier  design  is  to 
minimize  a  loss  function  L(y,  C(x))/  which  is  a  quantitative  measure  for  the  loss  incurred  by 
the  classifier  when  it  predicts  ('(x)  for  x  whose  true  label  is  y.  The  minimization  is  performed 
for  N  independent  training  examples  (x,  y)  drawn  from  T (x,  y),  leading  to  the  empirical  loss 
minimization  (Vapnik  1998  1999) 


1  N 

min^5ZL(^(Xi))’ 

V  i= 1 


with  (x,  y)  ~  T (x,  y) 


(1) 


The  empirical  loss  is  known  to  approach  the  true  loss  when  N  — >  oo. 

A  learning  algorithm  based  on  the  empirical  loss  minimization  in  (1)  implicitly  assumes 
that  the  future  test  instances  are  also  drawn  from  T (x,  y).  It  is  this  assumption  that  assures  that 
the  classifier  generalizes  to  test  instances  when  it  is  trained  to  minimize  empirical  loss  on  train¬ 
ing  examples.  This  assumption,  however,  is  often  violated  in  practice,  since  training  examples 
and  test  instances  may  correspond  to  different  collections  of  measurements  (likely  performed 
at  different  times  under  different  experimental  conditions)  and  the  class  memberships  of  the 
measurements  may  also  change.  These  issues  can  introduce  statistical  differences  between  the 
training  examples  and  the  test  instances;  the  UXO-sensing  problem  discussed  above  consti¬ 
tutes  an  important  example  for  which  the  aforementioned  statistical  issues  hold,  concerning 
the  utility  of  existing  labeled  (training)  data. 

Assume  that  one  has  training  examples  from  a  distribution  Zl(x,  y)  which  is  different  from 
T (x,  y).  For  convenience  of  exposition,  we  call  T (x,  y)  the  primary  or  target  distribution  and 

traditionally,  the  (probabilistic)  mapping  Pr(y|x)  is  called  a  concept,  and  Pr(x)  is  called  a  virtual  concept 
(language  describing  the  concept)  (Widmer  and  Kubat  1993).  For  simplicity,  usually  they  are  collectively  called  a 
concept. 
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call  M(x,  y)  the  auxiliary  distribution.  Accordingly,  the  examples  drawn  from  T (x,  y)  are  called 
primary  data  and  the  examples  drawn  from  M(x,  y)  are  called  auxiliary  data. 

In  order  to  write  the  empirical  loss  for  T  in  terms  of  examples  drawn  from  A,  one  may 
employ  the  technique  of  importance  sampling  (Robert  and  Casella  1999).  By  doing  so,  one 
makes  the  following  modifications  to  the  expression  of  empirical  loss  (1) 

IvYl  with  (X>2/)  ~  ^(x?y)  (2) 

where  is  the  importance  weight.  It  is  known  that  the  modification  does  not  change  the 
asymptotic  behavior  provided  that  M(x,  y )  has  the  same  nonzero  support  as  T (x,  y). 

Unfortunately,  both  M(x,  if)  and  T (x,  if)  are  unknown  to  the  algorithm;  all  that  is  available 
are  samples  from  M(x,  if).  The  challenge,  therefore,  is  to  learn  a  classifier  on  training  examples 
drawn  from  M(x,  if)  such  that  the  resulting  classifier  still  generalizes  to  test  instances  drawn 
from  T (x,  y).  Clearly,  without  assuming  any  knowledge  about  the  relationship  between  A(x.  if) 
and  T (x,  y ),  there  is  little  one  can  do  but  to  treat  the  training  examples  as  if  they  are  from  the 
target  distribution.  This  will,  of  course,  introduce  errors,  which  may  be  intolerable  when  the 
difference  between  M(x,  y)  and  T (x,  y)  is  large. 

The  problem  of  learning  on  examples  from  one  distribution  with  the  goal  of  generalizing  to 
instances  from  a  different  distribution  has  been  addressed  in  different  contexts,  using  different 
names.  In  the  following,  we  provide  a  brief  review  of  this  previous  work. 

1.1  Tackling  Concept  Drifts  in  Time- Varying  Data 

Many  data  come  naturally  in  streams,  collected  over  a  period  of  time.  Such  applications  include 
weather  recordings,  sales  and  customer  data,  surveillance  video  streams,  to  name  a  few.  For 
streamed  data,  it  is  natural  to  consider  online  learning,  in  which  the  leaner  is  dynamically 
presented  with  the  true  label  after  it  makes  a  prediction  and  updates  its  hypothesis  based  on 
newly  received  true  labels.  When  the  streamed  data  are  recorded  over  an  extended  period  of 
time,  the  statistics  in  the  data  are  likely  to  change.  The  time-dependent  variation  of  statistics 
in  streamed  data  are  termed  concept  drift  in  the  literature  (Klinkenberg  and  Joachims  2000; 
Tenenbaum  1999;  Wang  et  al.  2003;  Widmer  and  Kubat  1996  1993). 

Concept  drift  falls  under  the  general  formulation  in  (2).  Here  the  target  distribution  char¬ 
acterizes  the  statistics  of  the  most  recent  recordings,  and  there  is  an  auxiliary  distribution  char¬ 
acterizing  the  recordings  in  each  time  interval  in  the  past.  The  goal  in  concept-drift  learning 
is  to  employ  the  available  recordings  to  build  up  the  target  concept  (i.e.,  the  mapping  from  a 
feature  vector  to  the  associated  class  label)  for  the  current  moment.  An  important  notion  is  the 
age  of  each  recording,  which  determines  the  utility  of  the  recording  to  the  current  prediction. 

Three  widely  used  methods  for  handling  concept  drift  are:  time  windows  (Widmer  and  Kubat 
1996),  instance  weighting  (Klinkenberg  and  Ruping  2003),  and  ensemble  learning  (Wang  et  al. 
2003).  In  the  first  method,  a  time  window  is  applied  to  the  data  stream  and  the  data  within  the 
window  are  employed  to  build  the  classifier  for  the  current  concept.  The  window  keeps  mov¬ 
ing  towards  the  future  so  that  the  most  recent  recording  is  always  included  in  the  window.  The 
window  acts  like  a  limited  memory,  with  the  data  outside  the  window  forgotten.  A  key  issue 
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is  to  determine  the  window  size,  which  should  accurately  capture  the  rate  of  concept  drift.  The 
method  of  instance  weighting  is  based  on  the  observation  that  the  importance  of  a  past  exam¬ 
ple  to  the  current  concept  does  not  change  abruptly  but  rather  decreases  gradually  A  weight  is 
assigned  to  each  past  example  to  reflect  its  importance.  Assuming  that  concept  drift  is  mono¬ 
tonic  (i.e.,  newer  examples  are  always  more  important  than  older  ones),  one  can  construct  the 
weights  according  to  the  age  of  each  example;  for  example,  one  can  choose  weights  that  de¬ 
crease  exponentially  with  the  age  of  examples  (Klinkenberg  and  Ruping  2003).  The  method  of 
ensemble  learning  maintains  an  ensemble  of  classifiers,  instead  of  a  single  one,  for  the  current 
concept.  This  is  done  by  dividing  the  data  stream  into  chunks  and  learning  a  classifier  based 
on  each  data  chunk.  The  relevance  of  each  classifier  to  the  current  concept  is  evaluated  by  the 
generalization  error  when  applying  the  classifier  to  the  most  recent  data  chunk.  The  relevance 
is  employed  as  a  weight  applied  to  each  classifier  and  the  weighted  classifiers  are  employed  to 
make  predictions  for  the  current  data  chunk. 

1.2  Sample  Selection  Bias  in  Econometrics 

In  econometrics,  the  observed  data  are  often  a  nonrandomly  selected  sample  of  the  true  dis¬ 
tribution  of  interest.  If  the  distribution  of  interest  is  T,  the  selection  bias  results  in  samples 
drawn  from  A  which  is  different  from  T.  Heckman  (1979)  developed  a  method  to  correct  the 
sample-selection  bias  for  linear  regression  models.  The  basic  idea  of  Heckman's  method  is  that 
if  one  can  estimate  the  probability  of  an  observation  being  selected  into  the  sample,  one  can 
use  this  probability  estimate  to  correct  the  selection  bias. 

Heckman's  model  has  recently  been  extended  to  classification  problems  (Zadrozny  2004), 
where  it  is  assumed  that  the  test  instances  are  drawn  from  T (x,  y)  =  Pr(x,  y)  while  the  training 
examples  are  drawn  from  A(x,  y )  =  Pr(x,  y\s  =  1),  where  the  variable  s  controls  the  selection  of 
training  examples:  if  s  —  1,  (x,  y)  is  selected  into  the  training  set;  if  s  —  0,  (x,  y)  is  not  selected 
into  the  training  set.  Evidently,  unless  s  is  independent  of  (x,  y),  Pr(x,  y\s  =  1)  ^  Pr(x.  y)  and 
hence  T (x,  y)  is  different  from  T (x,  y).  By  Bayes  rule, 

Pr(x,y)  =  Pr(>  =  1) 

Pr(x,  y\s  —  1)  Pr(s  =  l|x,  y) 
or 

T  (A  V)  =  Pr(s  =  i)  (4) 

A  (x,y)  Pr(s  =  l|x,  y) 

plugging  this  into  (2),  one  has 

^y)  L fo’  with  (x,j/)~Pr(x,j/|s  =  l)  (5) 

which  implies  that  if  one  has  access  to  one  can  correct  the  selection  bias  by  using 

the  empirical  loss  as  expressed  in  (5).  In  the  special  case  when  Pr(s  =  l|x, y)  =  Pr(s  =  l|x), 
one  may  estimate  Pr(s  =  l|x)  from  a  sufficient  sample  of  Pr(x,  s)  if  such  a  sample  is  available 
(Zadrozny  2004).  In  the  general  case,  however,  it  is  difficult  to  estimate  Pr^'^  y)  /  as  we  do  not 
have  a  sufficient  sample  of  Pr(x,  y,  s )  (if  we  do,  we  already  have  a  sufficent  sample  of  Pr(x,  y), 
which  contradicts  the  assumption  of  the  problem). 
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1.3  Overview  of  This  Work 

In  this  paper  we  propose  an  efficient  algorithm  for  solving  the  general  problem  of  learning  on 
examples  from  A  with  the  goal  of  generalizing  to  instances  from  T,  when  A  is  different  from 
T.  We  consider  the  case  in  which  we  have  a  fully  labeled  auxiliary  data  set  Va  and  a  partially 
labeled  primary  data  set  Vp  —  Vf  U  Vp,  where  Vf  are  labeled  and  Vp  unlabeled.  We  assume 
that  Vp  are  examples  of  the  primary  concept  T  (the  concept  we  are  interested  in)  and  Va  are 
examples  of  the  auxiliary  concept  A  (the  one  providing  indirect  and  low-quality  information 
about  T).  Our  objective  is  to  use  a  mixed  training  set  Vtr  =  Vf  U  Va  to  train  a  classifier  that 
predicts  the  labels  of  Vf  accurately,  with  the  hope  that  Vf  is  required  to  have  a  small  number 
of  examples. 

Assume  Vp  ~  Pr(x,  y ).  In  light  of  (3),  we  can  write  Va  ~  Pr(x,  y\s  =  1)  as  long  as  the  source 
distributions  of  Vp  and  Va  have  the  same  support  of  nonzero  probability2.  As  explained  previ¬ 
ously,  it  is  difficult  to  correct  the  mismatch  by  directly  estimating  •  Therefore  we  take 

an  alternative  approach.  We  introduce  an  auxiliary  variable  //,  for  each  (x“,  yf)  e  V"  to  reflect 
its  mismatch  with  Vp  and  to  control  its  participation  in  the  learning  process.  The  p's  play  a 
similar  role  as  the  weighting  factors  in  (5).  However,  unlike  the  weighting  factors,  the 

auxiliary  variables  are  estimated  along  with  the  classifier  in  the  learning.  We  employ  logistic 
regression  as  a  specific  classifier  and  develop  our  method  in  this  context. 

The  remainder  of  the  paper  is  organized  as  follows.  A  detailed  description  of  the  proposed 
method  is  provided  in  Section  2,  followed  by  description  of  a  fast  learning  algorithm  in  Section 
3  and  a  theoretical  discussion  in  Section  4.  In  Section  5  we  present  a  method  to  actively  define 
Vf  when  Vf  is  initially  empty.  In  Section  6  we  demonstrate  the  ideas  presented  here  using  sim¬ 
ulated  data,  as  well  as  real  data  of  interest  for  detecting  unexploded  ordnance  (UXO).  Finally, 
Section  7  provides  conclusions. 


2  Migratory  Logistic  Regression  (MigLogit):  Learning  Jointly 
on  the  Primary  and  Auxiliary  Data 

We  assume  Vf  are  fixed  and  nonempty,  and  without  loss  of  generality,  we  assume  Vf  are  al¬ 
ways  indexed  prior  to  Vp,  i.e.,  Vf  =  {(xf,^)}^  and  Vf  =  {(xf,yf)  :  yf  missing}^vp+r  We 
use  Na,  Np,  and  Nf  to  denote  the  size  (number  of  data  points)  in  V",  Vp,  and  Vf,  respectively. 
In  Section  5  we  discuss  how  to  actively  determine  Vf  when  Vf  is  initially  empty.  We  con¬ 
sider  the  binary  classification  problem  and  the  labels  ya ,  yp  €  {—1, 1}-  For  notational  simplicity, 
we  let  x  always  include  a  1  as  its  first  element  to  accommodate  a  bias  (intercept)  term,  thus 
xp .  x"  £  M':/+ 1  where  d  is  the  number  of  features.  For  a  primary  data  point  (xf,  yf)  e  Vf,  we 
follow  standard  logistic  regression  to  write 


Pr(yf|xf;  w)  =  a(yfw2x 


(6) 


2For  any  Pr(x,y|s  =  1)  ^  0  and  Pr(x,y)  ^  0,  there  exists  =  pr^’^  €  (0,oo)  such  that  (3)  is 

Pr(s=l) 


Pr(s=l|x,y)  Pr(x,y| 

satisfied.  For  Pr(x,  y\s  =  1)  =  Pr(x,  y )  =  0,  any  p/J^ijxV)  ^  0  makes  (3)  satisfied. 
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where  w  €  Md+1  is  a  column  vector  of  classifier  parameters  and  a(rj)  =  1+c,Xp(_.f^  is  the  sigmoid 
function.  For  a  auxiliary  data  point  (x“,  yf)  e  V",  we  define 

Pr(j/“|x?;  w,  m)  =  a(yf  wTx“  +  yfpf)  (7) 

where  p,  is  an  auxiliary  variable.  Assuming  the  examples  in  "Df  and  D“  are  drawn  i.i.d.,  we 
have  the  log-likelihood  function 

f(w,g;^UD“) 

=  Efii  In  ^(l/fwTxf )  +  In  a(7/>Tx“+7/>i)  (8) 

where  p  =  [/i i ,  •  •  •  ,  pn°-]t  is  a  column  vector  of  all  auxiliary  variables. 

The  auxiliary  variable  p,  is  introduced  to  reflect  the  mismatch  of  (x“,  yf)  with  Vp  and  to  con¬ 
trol  its  participation  in  the  learning  of  w.  A  larger  yfp,  makes  Pr(y“|x“;  w,  /./*)  less  sensitive  to  w. 
When  y^Hi  =  oo,  Pr(y“|x“;  w.  //,)  =  1  becomes  completely  independent  of  w.  Geometrically,  the 
p,  is  an  extra  intercept  term  that  is  uniquely  associated  with  x“  and  causes  it  to  migrate  towards 

class  yf.  If  (x“,  yf)  is  mismatched  with  the  primary  data  Vp,  w  cannot  make  Ei=i  In  a(yfwTxf) 
and  In  a(yf  wTx“)  large  at  the  same  time.  In  this  case  x“  will  be  given  an  appropriate  ft,  to  allow 
it  to  migrate  towards  class  yf,  so  that  w  is  less  sensitive  to  (x“,  yf  )  and  can  focus  more  on  fitting 
Vf.  Evidently,  if  the  p's  are  allowed  to  change  freely,  their  influence  will  override  that  of  w 
in  fitting  the  auxiliary  data  Va  and  then  Da  will  not  participate  in  learning  w.  To  prevent  this 
from  happening,  we  introduce  constraints  on  p,  and  maximize  the  log-likelihood  subject  to  the 
constraints: 


maxw|i  f(w,  p;  Vf  U  Va)  (9) 

subject  to  EiTi Vi  C  >  0  (10) 

y?(H>  0,  f  =  1,  2,  •  •  •  ,  Na  (11) 

where  the  inequalities  in  (11)  reflect  the  fact  that  in  order  for  x“  to  fit  yf  =  1  (or  yf  =  —1)  we 


need  to  have  p*  >  0  (or  //,  <  0),  if  we  want  p,  to  exert  a  positive  influence  in  the  fitting  pro¬ 
cess.  Under  the  constraints  in  (11),  a  larger  value  of  yfp,  represents  a  larger  mismatch  between 
(xf,yf)  and  V  and  accordingly  makes  (x“,yf)  play  a  less  important  role  in  determining  w. 
The  classifier  resulting  from  solving  the  problem  in  (9)-(ll)  is  referred  to  as  migratory  logistic 
regression  (MigLogit). 

The  C  in  (10)  reflects  the  average  mismatch  between  Va  and  Vp  and  controls  the  average 
participation  of  Va  in  determining  w.  It  can  be  learned  from  data  if  we  have  a  reasonable 
amount  of  Vf.  However,  in  practice  we  usually  have  no  or  very  scarce  Vf  to  begin  with.  In 
this  case,  we  must  rely  on  other  information  to  set  C.  We  will  come  back  to  a  more  detailed 
discussion  on  C  in  Section  4. 


3  Fast  Learning  Algorithm 

The  optimization  problem  in  (9),  (10),  and  (11)  is  concave  and  any  standard  technique  can 
be  utilized  to  find  the  global  maxima.  However,  there  is  a  unique  p,  associated  with  every 


6 


Migratory  Logistic  Regression  for  Learning  Concept  Drift  Between  Two  Data  Sets 


Liao  and  Carin 


(x“,  yf)  G  Va,  and  when  Va  is  large  using  a  standard  method  to  estimate  p' s  can  consume  most 
of  the  computational  time. 

In  this  section,  we  give  a  fast  algorithm  for  training  MigLogit,  by  taking  a  block-coordinate 
ascent  approach  (Bertsekas  1999),  in  which  we  alternately  solve  for  w  and  //,  keeping  one  fixed 
when  solving  for  the  other.  The  algorithm  draws  its  efficiency  from  the  analytic  solution  of  /i, 
which  we  establish  in  the  following  theorem.  Proof  of  the  theorem  is  given  in  the  appendix, 
and  Section  4  contains  a  discussion  that  helps  to  understand  the  theorem  from  an  intuitive 
perspective. 

Theorem  1:  Let  f(z)  be  a  twice  continuously  differentiable  function  and  its  second  derivative 
f"{z)  <  0  for  any  zeK.  Let  b\  <  b-2  <  ■  ■  ■  <  bN/  R>  0,  and 


n  =  rna x{m  :  mbm  —  Eli  bi  <  R,  1  <  m  <  N} 

Then  the  problem 

max{2i}  Eli  f(bi  +  Zi) 
subject  to  Ef=i  A  <  R-  R>  0 
Zi>  0,  i  =  1,2,- ••  ,N 


has  a  unique  global  solution 


Zi  = 


-  E?-i  bj  +  -R  —  bi,  1  <  i  <  n 

n  =  l  J  n  1  ‘  —  — 

0,  n  <i  <  N 


(12) 


(13) 

(14) 

(15) 


(16) 


For  a  fixed  w,  the  problem  in  (9)-(ll)  is  simplified  to  maximizing  Eli  lncr(?/fwTx“  +  yf/i,) 
with  respect  to  /i,  subject  to  ^  Eli  ViTi  <  C,  C  >  0,  and  y“yj  >  0  for  i  =  1,2,  •••  ,iVa. 
Clearly  In  a(z)  is  a  twice  continuously  differentiable  function  of  z  and  its  second  derivative 
In  a(z)  =  —<j(z)a(—z)  <  0  for  —  oo  <  z  <  oo.  Thus  Theorem  1  applies.  We  first  solve  {yf/i,} 
using  Theorem  1,  then  {pi}  are  trivially  solved  using  the  fact  y“  G  {—1,1}.  Assume  yf  wrx^  < 
yf  wtx^2  <  •  •  •  <  y£  wT  x^,jva,  where  k±,  k2,  ■  ■  ■  ,  k]y°-  is  a  permutation  of  1,  2,  •  •  •  ,  Na.  Then  we 
can  write  the  solution  of  {pf}  analytically. 


f  1  n,a  V"  WTY° 

2^i=l  VkiW 


+  - 


o, 


W  X 


ki 


1  <  i  <  n 
n  <  i  <  Na 


(17) 


where 

n  =  max  {m  : 

1  <  m  <  Na}  (18) 

For  a  fixed  //,  we  use  the  standard  gradient-based  method  (Bertsekas  1999)  to  find  w.  The  main 
procedures  of  the  fast  training  algorithm  for  MigLogit  are  summarized  in  Table  1,  where  the 
gradient  Vwf  and  the  Hessian  matrix  v  are  computed  from  (8). 
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Table  1:  Fast  Learning  Algorithm  of  Migratory  Logistic  Regression  (MigLogit) 
Input:  Va  U  Vf  and  C;  Output:  w  and  , 


1. 

2. 

3. 

4. 

5. 

6. 


7. 

8. 
9. 


Initialize  w  and  /i,  =  0  for  i  =  1,  2,  •  •  •  ,  Aa. 

Compute  the  gradient  Vwf  and  Hessian  matrix  V;vL 
Compute  the  ascent  direction  d  =  — (V^£)_1VWL 
Do  a  linear  search  for  the  step-size  a*  =  arg maxa  £(w  +  ad). 
Update  w:  w  w  +  a*d. 

Sort  {yfwrx“]A“  in  ascending  order.  Assume  the  result  is 

yakl  wTx^  <  yak2  wtx“2  <  •  •  •  <  ylNa™T*-kNa' where  h,k2,--‘  >kNa 

is  a  permutation  of  1,  2,  •  •  •  ,iV“- 
Find  the  n  using  (18). 

Update  the  auxiliary  variables  using  (17). 

Check  the  convergence  of  t  exit  and  output  w  and  {/i ,;})!))  if  con¬ 
verged;  go  back  to  2  otherwise. 


4  Auxiliary  Variables  and  Choice  of  C 

Theorem  1  and  its  constructive  proof  in  the  appendix  offers  some  insight  into  the  mecha¬ 
nism  of  how  the  mismatch  between  Va  and  V1’  is  compensated  through  the  auxiliary  vari¬ 
ables  {/q}.  To  make  the  description  easier,  we  think  of  each  data  point  x“  e  Va  as  getting 
principal  importance  y“wTx“  from  w  and  additional  importance  from  a  given  budget  to¬ 
taling  NaC  (C  represents  the  average  budget  for  a  single  xa).  From  the  appendix,  N"C  is  dis¬ 
tributed  among  the  auxiliary  data  { x“ }  by  a  "smallest-first"  rule:  the  smallest  x)'  (that  which 
has  the  smallest  y%  wTx£  ),  gets  a  portion  y£  jikl  from  NaC  first,  and  when  the  total  impor¬ 
tance  yj*  wTx“  +  y£  /ikl  reaches  the  value  of  the  second  smallest  x^,  NaC  becomes  equally 
distributed  to  x£  and  x)'o  such  that  their  total  importances  are  always  equal.  Then,  when 
ykl  wrx^  +  yklHk !  =  l/fc2wi  xfc2  +  Ul2yk2  reach  the  importance  of  the  third  smallest,  NaC  becomes 
equally  distributed  to  three  of  them  to  make  them  equal.  The  distribution  continues  in  this  way 
until  the  budget  NaC  is  used  up.  The  "smallest-first"  rule  is  essentially  a  result  of  the  concavity 
of  the  logarithmic  sigmoid  function  lncr(-).  The  goal  is  to  maximize  lnu(y“wTx“  +  yfi-i,)- 
The  concavity  of  In  cr(-)  dictates  that  for  any  given  portion  of  N"C,  distributing  it  to  the  smallest 
makes  the  maximum  gain  in  In  a. 

The  C  is  used  as  a  means  to  compensate  for  the  loss  that  Da  may  suffer  from  w.  The  classifier 
w  is  responsible  for  correctly  classifying  both  Va  and  Vv .  Because  V"  and  Vp  are  mismatched, 
w  cannot  satisfy  both  of  them:  one  must  suffer  if  the  other  is  to  gain.  As  Vp  is  the  primary  data 
set,  we  want  w  to  classify  Vp  as  accurately  as  possible.  The  auxiliary  variables  are  therefore 
introduced  to  represent  compensations  that  Va  get  from  C.  When  xa  gets  small  contribution 
from  w  and  is  small,  it  is  because  xa  is  mismatched  and  in  conflict  with  Vp  (assuming  perfect 
separation  of  Va,  no  conflict  exists  among  themselves).  By  the  "smallest  first"  rule,  the  most 
mismatched  x“  gets  compensation  first. 

A  high  compensation  whittles  down  the  participation  of  x“  in  learning  w.  This  is 
readily  seen  from  the  contribution  of  (x“,  y“)  to  Vwf  and  which  are  obtained  from  (8)  as 

cr(— yfwTxf  —  y“yj)y“x“  and  —  <x(— y“wTx“  —  yf/x,)cr(yf w2x“  +  yf/i,)x“x“  T,  respectively.  When 
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yflM  is  large,  cr(—yf wTx“  —  y“/i,)  is  close  to  zero  and  hence  the  contribution  of  (x“,  y“)  to  Vwf 
and  are  ignorable.  We  in  fact  do  not  need  an  infinitely  large  y''/J,  to  make  the  contributions 
of  x“  ignorable,  because  cr(/i)  is  almost  saturated  at  y  =  ±6.  If  y“wTx“  =  —6,  of— y“wTx“)  = 
0.9975,  implying  a  large  contribution  of  (x“,y“)  to  Vwf,  which  happens  when  w  assigns  x“  to 
the  correct  class  y“  with  probability  of  cr(y?wTxf)  =  of— 6)  =  0.0025  only.  In  this  nearly  worst 
case,  a  compensation  of  yf/i,  =  12  can  effectively  remove  the  contribution  of  (x“,yf)  because 
cr ( — y“ w T xf — y “ /i,: )  =  cr(6  — 12)  =  cr(— 6)  =  0.0025.  To  effectively  remove  the  contributions  of  Am 
auxiliary  data,  one  needs  a  total  budge  12A'n,  resulting  in  an  average  budget  C  =  12Nm/Na. 

To  make  a  right  choice  of  C ,  the  Nm/Na  should  represent  the  rate  that  V"  are  mismatched 
with  Vp.  This  is  because  we  want  NaC  to  be  distributed  only  to  that  part  of  Va  that  is  mis¬ 
matched  with  Vp,  thus  permitting  us  to  use  the  remaining  part  in  learning  w.  The  quantity 
Nm/Na  is  usually  unknown  in  practice.  However,  C  =  12Am/A“  gives  one  a  sense  of  at  least 
what  range  C  should  be  in.  As  0  <  Nm  <  Na,  letting  0  <  C  <  12  is  usually  a  reasonable  choice. 
In  our  experiences,  the  performance  of  MigLogit  is  relatively  robust  to  C,  as  demonstrated  in 
Section  6.2. 


5  Active  Selection  of  Vj 


In  Section  2  we  assumed  that  Vf  had  already  been  determined.  In  this  section  we  describe  how 
Vf  can  be  actively  selected  from  Vp,  based  on  the  Fisher  information  matrix  (Fedorov  1972; 
MacKay  1992).  The  approach  is  known  as  active  learning  (Cohn  et  al.  1995;  Krogh  and  Vedelsby 
1995). 

Let  Q  denote  the  Fisher  information  matrix  of  Vf  U  V"  about  w.  By  definition  of  the  Fisher 
information  matrix  (Cover  and  Thomas  1991),  Q  =  J—f—2 ,  and  substituting  (8)  into 

this  equation  gives  (a  brief  derivation  is  given  in  the  appendix) 


Q  =  ESioffi  -  +  ET=>?  (i  -  <)« 


!iV0 


a  T 


(19) 


where  af  =  cr( wrxf)  for  i  =  1,2,...,  Nf ,  and  cr“  =  u(wTx“  +  y,)  for  i  —  1,2, ,  Na,  and  w  and 
{/i,}  represent  the  true  classifier  and  auxiliary  variables. 

It  is  well  known  the  inverse  Fisher  information  Q-1  lower  bounds  the  covariance  matrix 
of  the  estimated  w  (Cover  and  Thomas  1991).  In  particular,  [det(Q)]-1  lower  bounds  the  prod¬ 
uct  of  variances  of  the  elements  in  w.  The  goal  in  selecting  Vf  is  to  reduce  the  variances,  or 
uncertainty,  of  w.  Thus  we  seek  the  Vf  that  maximize  det(Q). 

The  selection  proceeds  in  a  sequential  manner.  Initially  Vp  =  Vp,  Vf  is  empty,  and  Q  = 

G  Vp  is  selected  and  moved  from 
crf  )x''(x'))  /  .  At  each  iteration,  the 


Sili  at ( 1  —  crf)x“x“  T .  Then  one  at  a  time,  a  data  point  x) 
Vp  to  Vf.  This  causes  Q  to  be  updated  as:  Q  <—  Q  +  of  (1  - 
selection  is  based  on 


maxxpep;  det  {Q  +  of  (1  -  uf)xf(xf)r} 

=  maxxP£VP  { 1  +  of  (1  -  of )  (xf )TQ-1xf  }  (20) 

where  we  assume  the  existence  of  Q_1,  which  can  often  be  assured  by  using  sufficient  auxiliary 
data  Va. 
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Evaluation  of  (20)  requires  the  true  values  of  w  and  {//.*},  which  are  not  known  a  priori.  We 
follow  (Fedorov  1972)  and  replace  them  with  the  w  and  {pi}  that  are  estimated  from  Va  U  Vf, 
where  Vf  are  the  primary  labeled  data  selected  up  to  the  present. 


6  Results 

In  this  section  the  performance  of  MigLogit  is  demonstrated  and  compared  to  the  standard  lo¬ 
gistic  regression.  The  MigLogit  is  trained  using  Va  U  Vf,  where  Vf  are  either  randomly  selected 
from  Vp,  or  actively  selected  from  Vp  using  the  method  in  Section  5.  When  Vf  are  randomly 
selected,  50  independent  trials  are  performed  and  the  results  are  obtained  as  an  average  over 
the  trials.  Three  logistic  regression  classifiers  are  trained  using  different  combinations  of  Va 
and  Vf :  V"  U  Vf,  Vf  alone,  and  V"  alone,  where  Vf  are  identical  to  the  Vf  used  by  MigLogit. 
The  four  classifiers  are  tested  on  Vp  =  Vp  \  Vf  to  produce  the  test-error  rate  or  the  area  under 
the  ROC  curve.  Calculation  of  test  error  rates  is  based  on  the  following  decision  rule:  declare 
yp  =  —1  if  a( w1  xp)  <  0.5  and  yp  =  1  otherwise,  for  any  xp  e  Vp. 

The  performance  of  MigLogit  is  demonstrated  on  two  problem  domains.  The  first  is  a 
simulated  example  and  the  second  is  detection  of  unexploded  ordnance  (UXO)  where  the  UXO 
signatures  are  site-sensitive. 

Throughout  this  section  the  C  in  MigLogit  is  set  to  C  =  6  when  the  comparison  is  made 
to  logistic  regression.  In  addition,  we  present  a  comparison  of  MigLogit  with  different  G"s,  to 
examine  the  sensitivity  of  MigLogit's  performance  to  C. 


6.1  Synthesized  Data 


In  the  first  example,  the  primary  data  are  simulated  as  two  bivariate  Gaussian  distributions 
representing  class  " —l"  and  class  "+1",  respectively.  In  particular,  we  have  Pr(xp|yp  =  —1)  = 
M  (xp;  p0,  S)  and  Pr(xp| yp  =  1)  =  A f  (xp;  p±,  S),  where  the  Gaussian  parameters  p0  =  [0,  0]T, 


Pi  =  [2.3,2.3]t,  and  £ 


1.75  -0.433 

-0.433  1.25 


The  auxiliary  data  Va  are  then  a  selected 


draw  from  the  two  Gaussian  distributions,  as  described  in  (Zadrozny  2004).  We  take  the 
selection  probability  Pr(s|xp,  yp  =  —1)  =  cr(w0  +  W\K (xp,  ps0]  E))  and  Pr(s|xp,  yp  =  +1)  = 
o-(tc0  +  wiK(xp,  pl;T,)),  where  a  is  the  sigmoid  function,  w0  =  —  1,  w\  =  exp(l),  A'(xp,  = 

exp{— 0.5(xp— ^q)tE~1(xp— /xg)}  with  ps0  =  [2, 1]T,  and  AT(xp,  psp,  S)  =  exp{-0.5(xp-^s1)TS-1(xp- 
p{)}  with  pi  =  [0,  3]t.  We  obtain  150  samples  of  Vp  and  150  samples  of  Va,  which  are  shown 
in  Figure  3. 

The  MigLogit  and  logistic  regression  classifiers  are  trained  and  tested  as  explained  at  the 
beginning  of  this  section.  The  results  are  represented  as  test  error  rate  as  a  function  of  num¬ 
ber  of  primary  labeled  data  used  in  training,  and  are  shown  in  Figures  1  and  2.  Each  curve 
in  Figure  1  is  an  average  over  50  independent  trials,  with  each  trial  having  an  independent 
random  selection  of  Vf.  Figure  2  presents  the  active  learning  results,  with  Vf  actively  selected 
as  described  in  Section  5. 

Several  observations  are  made  from  inspection  of  Figures  1  and  2. 


•  The  MigLogit  consistently  outperforms  the  three  standard  logistic  regression  classifiers. 
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Performance  comparison 


Figure  1:  Test  error  rates  of  MigLogit  and  logistic  regression  on  the  toy  data,  as  a  function  of 
size  of  X?f.  The  primary  labeled  data  Vf  are  randomly  selected  from  Vp.  The  error  rates  are  an 
average  over  50  independent  trials  of  random  selection  of  Vf. 


Performance  comparison 

MigLogit(C=6)  trained  on  Da+active 
-  -  -  Logit  trained  on  Da+active 


0.3  - . 5-  . : .  -  .  Logit  trained  on  active  D£ 


01 - 1 - 1 - 1 - 1 - 1 - 1 - 

0  5  10  15  20  25  30  35 

Size  of  (Number  of  primary  labeled  examples) 


Figure  2:  Error  rates  of  MigLogit  and  logistic  regression  on  the  toy  data,  as  a  function  of  size  of 
T>f.  The  primary  labeled  data  T)f  are  actively  selected  from  Vp,  using  the  method  in  Section  5. 


by  a  considerable  margin.  This  improvement  is  attributed  to  a  selective  usage  of  the  ex¬ 
amples  in  Va.  In  particular,  each  example  in  Va  is  employed  according  to  its  agreement 
with  Vf:  a  good  agreement  warrants  a  higher  contribution  to  determination  of  the  clas¬ 
sifier  while  a  poor  agreement  makes  the  contribution  discounted.  The  selectivity  is  im¬ 
plemented  through  the  auxiliary  variables  which  are  estimated  based  on  a  few  examples 
from  Vp. 

•  The  performance  of  the  logistic  regression  trained  on  Vf  alone  changes  significantly  with 
the  size  of  Vf.  This  is  understandable,  considering  that  T>p  are  the  only  examples  deter¬ 
mining  the  classifier.  The  abrupt  drop  of  errors  from  iteration  10  to  iteration  11  in  Figure 
2  may  be  because  the  label  found  at  iteration  11  is  critical  to  determining  w. 
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Active  learning  iteration  0 


Active  learning  iteration  1 

. 


Active  learning  iteration  5 


Active  learning  iteration  10 


Active  learning  iteration  13 


Figure  3:  Illustration  of  active  data  selection  by  MigLogit.  Only  iterations  0,1,5,10,13  are  shown. 
The  different  symbols  are  defined  as:  blue  o  =  Vp  labeled  "—1",  red  o  =  VP  labeled  "+1",  green 
•  =  Va  labeled  "—1",  and  magenta  •  =  Va  labeled  "+1".  The  numbers  in  black  denote  Vf 
and  represent  the  order  of  selection.  The  smaller  •  near  the  decision  boundaries  symbolize 
weakened  participation  of  the  associated  Va  in  determining  w.  This  may  only  be  visible  in  the 
zoomed  figure  (iteration  13). 


•  The  logistic  regression  trained  on  Va  alone  performs  significantly  worse  than  MigLogit, 
reflecting  a  marked  mismatch  between  Va  and  Vp. 

•  The  logistic  regression  trained  on  T>a  U  Vp  improves,  but  mildly,  as  Vf  grows,  and  it  is 
ultimately  outperformed  by  the  the  logistic  regression  trained  on  T>p  alone,  demonstrat¬ 
ing  that  some  data  in  Va  are  mismatched  with  T>p  and  hence  cannot  be  correctly  classified 
along  with  Vp,  if  the  mismatch  is  not  compensated. 

•  As  "Df  grows,  the  logistic  regression  trained  on  Vf  alone  finally  approaches  to  MigLogit, 
showing  that  without  the  interference  of  Va,  a  sufficient  T>f  can  define  a  correct  classifier. 

•  All  four  classifiers  benefit  from  the  actively  selected  V]',  and  this  is  consistent  with  the 
general  observation  with  active  learning  (Cohn  et  al.  1995;  Krogh  and  Vedelsby  1995). 

The  labeled  primary  examples  Df  play  double  roles  in  the  learning  process.  On  the  one 
hand  they  help  to  find  the  correct  w,  and  on  the  other  hand  they  serve  as  representative  primary 
labeled  data  in  finding  the  degree  of  agreement  of  each  auxiliary  example  with  primary  data 
(i.e.,  estimating  the  auxiliary  variables).  Suppose  that  it  requires  n i  primary  labeled  examples 
to  find  the  auxiliary  auxiliary  variables  for  compensating  the  mismatch  between  T>a  and  Vp, 
and  that  it  requires  n2  labeled  primary  examples  alone  (without  using  auxiliary  examples)  to 
find  the  correct  classifier.  One  may  conjecture  that  ri  i  is  smaller  than  n2-  Although  we  have 
not  proven  this  rigorously,  the  results  in  Figures  1  and  2  provide  empirical  evidence  for  this 
being  true:  note  that  MigLogit  uses  much  fewer  primary  labeled  examples  to  find  the  correct 
classifier  than  Logit  trained  on  V]‘  does. 
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The  double  roles  assumed  by  the  primary  labeled  examples  make  it  a  critical  issue  how  to 
select  these  examples.  We  already  see  that  actively  selected  examples  give  significant  boost  to 
the  performance.  This  makes  it  clear  that  active  learning  is  a  more  appropriate  strategy  than 
pure  random  selection,  and  contributes  in  an  important  manner  to  the  proposed  method. 

To  better  understand  the  active  selection  process,  we  show  in  Figure  3  the  first  few  iterations 
of  active  learning.  Iteration  0  corresponds  to  the  initially  empty  Vp,  and  iterations  1,  5,  10, 
13  respectively  correspond  to  1,  5,  10,  13  data  points  selected  accumulatively  from  Vr‘  into 
Vp.  Each  time  a  new  data  point  is  selected,  the  w  is  re-trained,  yielding  the  different  decision 
boundaries.  As  can  be  seen  in  Figure  3,  the  decision  boundary  does  not  change  much  after  10 
data  are  selected,  demonstrating  convergence. 

In  Figure  3,  each  auxiliary  data  point  x“  e  D"  is  symbolically  displayed  with  a  size  in 
proportion  to  exp(— yfni/ 12),  hence  a  small  symbol  of  auxiliary  data  corresponds  to  large  yf/i, 
and  hence  indicates  a  discounted  contribution  of  the  i-th  auxiliary  example  to  determination 
of  w.  The  auxiliary  data  that  cannot  be  correctly  classified  along  with  the  primary  data  are 
de-emphasized  by  the  MigLogit.  Usually  the  auxiliary  data  near  the  decision  boundary  are 
de-emphasized . 

6.2  Application  to  Detection  of  Site-Sensitive  Unexploded  Ordnance  (UXO) 

Unexploded  ordnance  (UXO)  consists  of  ordnance  that  did  not  explode  upon  impact  with  the 
ground.  The  UXO  items  are  typically  buried  and  consist  of  significant  quantities  of  metal. 
Sensing  of  UXO  is  typically  performed  using  electromagnetic  induction  (EMI)  and  magne¬ 
tometer  sensors.  The  principal  challenge  involves  distinguishing  actual  UXO  from  buried  non¬ 
ordnance  conducting  materials.  For  a  more  detailed  general  description  of  UXO  sensing,  see 
(Zhang  et  al.  2003). 

The  sensor  signature  of  a  given  UXO  item  is  dependent  on  the  soil  properties  as  well  as  the 
history  of  the  site  in  which  it  is  located,  the  latter  having  a  particular  strong  influence  on  the 
signature.  The  site  history  is  dictated  by  complex  factors  such  as  co-located  ordnance,  the  way 
the  ordnance  impacted  the  soil,  and  the  surrounding  man-made  conducting  clutter  and  UXO 
fragments.  Therefore  UXO  detection  is  a  typical  site-sensitive  problem. 

The  site-sensitivity  makes  standard  supervised  classification  techniques  an  inappropriate 
choice  for  UXO  detection,  due  to  the  difficulty  in  constituting  a  universal  training  set  for  classi¬ 
fier  design.  The  training  examples  collected  at  previous  sites  are  often  not  appropriate  for  use 
for  analysis  of  the  current  site  since  the  current  site  is  often  different  from  the  previous  ones  (in 
the  sense  described  above).  Despite  these  disparities,  the  examples  from  previous  sites  are  not 
totally  useless;  indeed,  they  can  provide  quite  useful  information  about  the  examples  for  the 
current  site  (particularly  for  the  UXO,  since  the  ordnance  types  at  different  sites  are  often  the 
same  or  similar;  the  clutter  signatures  are  most  often  site  specific).  The  usefulness  of  existing 
labeled  data  for  a  new  site  of  interest  is  dictated  by  the  characteristics  of  the  new  site,  as  well 
as  on  the  characteristics  of  the  sites  from  which  the  labeled  data  were  acquired;  these  inter¬ 
relationships  are  complex  and  often  difficult  to  characterize  a  priori  (often  accurate  records  are 
not  available  about  the  history  of  a  former  bombing  site). 

Let  the  examples  at  the  current  UXO  cite  be  distributed  according  to  T (x,  y),  and  the  exam¬ 
ples  at  a  previous  UXO  cite  be  distributed  according  to  Xl(x,  y).  It  is  seen  that  the  empirical  loss 
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for  detection  of  UXO  at  the  current  cite  is  well  described  by  (2).  Therefore  one  can  employ  the 
technique  of  MigLogit  to  design  the  desired  classifier. 

To  demonstrate  the  utility  of  MigLogit  in  UXO  detection,  we  here  consider  two  UXO  sites 
and  design  the  classifier  for  the  primary  site  (the  one  we  are  interested  in)  by  using  exam¬ 
ples  from  another  site  (the  auxiliary  site).  The  auxiliary  site  is  called  Jefferson  Proving  Ground 
(JPG),  for  which  one  is  provided  with  the  EMI  and  magnetometer  measurements  as  well  the 
associated  labels  (which  are  binary:  UXO  or  non-UXO).  The  examples  from  the  auxiliary  site 
constitutes  the  auxiliary  data  Va.  The  primary  site  we  are  interested  in  is  called  Badlands,  for 
which  we  have  unlabeled  EMI  and  magnetometer  measurement  for  constituting  the  primary 
data  T>p.  The  labeled  JPG  data  consists  of  104  total  items,  of  which  16  are  UXO  and  88  are  non- 
UXO.  The  Badlands  site  consists  of  a  total  of  492  items,  57  of  which  are  UXO  and  the  remaining 
435  are  non-UXO.  These  two  former  bombing  ranges  exist  at  two  very  different  geographical 
locations  within  the  United  States. 


Performance  comparison 


Performance  improvement  of  MigLogit  (C=6)  over  Logit 


MigLogit(C=6)  trained  on  Da+active  D[| 

■  Logit  trained  on  Da+active 

■  Logit  trained  on  active  d£ 

Logit  trained  on  Da 


100 


Size  of  (Number  of  primary  labeled  examples) 


(a)  AUC 


(b)  AUC  of  MigLogit  minus  AUC  of  logistic  regres¬ 
sion 


Figure  4:  (a)  The  area  under  ROC  curve  (AUC)  of  MigLogit  and  logistic  regression  on  the 
UXO  data,  as  a  function  of  size  of  Vj  (b)  The  AUC  of  MigLogit  minus  the  AUC's  of  logistic 
regression.  The  auxiliary  data  are  collected  at  Jefferson  Proving  Ground  (JPG)  and  the  primary 
data  are  collected  at  Badlands.  The  primary  labeled  data  Vj  are  randomly  selected  from  Vp. 
Each  curve  is  an  average  over  50  independent  trials  of  random  selection  of  Vf. 

The  UXO  sensor  measurements  are  mapped  to  four  dimensional  feature  vectors  [log (Mp), 
log (Mz),  z,  log(^)],  where  Mp  and  Mz  are  the  dipole  moments  perpendicular  and  parallel  to 
the  target  axis,  respectively,  and  z  is  the  approximate  target  depth  (Zhang  et  al.  2003).  These 
parameters  are  estimated  by  fitting  the  EMI  and  magnetometer  measurements  to  a  physical 
model  (Zhang  et  al.  2003);  the  features  from  this  study  are  available  upon  request  to  the  au¬ 
thors.  Each  feature  is  normalized  to  have  zero  mean  and  unitary  variance.  In  UXO  detection, 
one  is  interested  in  the  receiver's  operating  characteristic  (ROC)  curve,  particularly  the  area 
under  ROC  curve  (AUC)  (Elanley  and  McNeil  1982). 

The  results  are  presented  in  Figures  4(a)  and  5(a),  where  each  curve  is  the  area  under  ROC 
curve  as  a  function  of  the  size  of  Vf  The  results  in  Figure  4(a)  are  obtained  by  randomly  la- 
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Performance  comparison 


Performance  improvement  of  MigLogit(C=6) 


Size  of  (Number  of  primary  labeled  examples) 


(a)  AUC 


(b)  AUC  of  MigLogit  minus  AUC  of  logistic  regres¬ 
sion 


Figure  5:  (a)  The  area  under  ROC  curve  (AUC)  of  MigLogit  and  logistic  regression  on  the 
UXO  data,  as  a  function  of  size  of  Vf  (b)  The  AUC  of  MigLogit  minus  the  AUC's  of  logistic 
regression.  The  auxiliary  data  are  collected  at  Jefferson  Proving  Ground  (JPG)  and  the  primary 
data  are  collected  at  Badlands.  The  primary  labeled  data  Vf  are  actively  selected  from  Vp,  based 
on  the  method  in  Section  5. 

beling  primary  data  and  by  averaging  the  AUC's  over  50  independent  trials.  The  results  in 
Figure  5(a)  are  obtained  by  actively  labeling  primary  data  using  the  method  in  Section  5.  For 
a  better  view  of  the  improvement  achieved  by  MigLogit,  we  plot  in  Figures  4(b)  and  5(b)  the 
AUC  of  MigLogit  with  the  AUC  of  each  logistic  regression  classifier  subtracted.  A  positive  dif¬ 
ference  indicate  performance  improvement  while  a  negative  difference  indicates  performance 
degradation.  We  have  the  following  observations: 

•  With  Vf  determined  randomly,  MigLogit  outperforms  all  logistic  regression  classifiers  ex¬ 
cept  at  the  early  part  of  the  curves,  where  there  are  very  few  examples  in  Vf.  As  discussed 
at  the  end  of  Section  6.1,  the  primary  labeled  examples  are  critical  to  the  performance  of 
MigLogit.  With  a  few  randomly  selected  examples  one  may  not  be  able  to  find  the  appro¬ 
priate  auxiliary  variables,  leading  to  a  poor  compensation  of  the  mismatch  between  Vp 
and  V°  and  therefore  performance  degradation. 

•  With  Vf  actively  determined,  MigLogit  outperforms  all  logistic  regression  classifiers,  re¬ 
gardless  of  the  number  of  primary  labeled  examples.  This  verifies  that  a  good  choice  of 
Vf  is  important  to  the  performance  of  MigLogit. 

•  Active  learning  is  not  only  beneficial  to  MigLogit,  but  to  other  classifiers  as  well,  again 
demonstrating  the  advantage  of  active  learning. 

•  All  conclusions  observed  in  the  simulated  results  extend  to  the  UXO  results  here. 

These  observations  suggest  that  MigLogit  successfully  leverages  the  auxiliary  data  from 
previous  UXO  sites  to  quickly  find  the  correct  classifier  for  the  new  site,  requiring  much  fewer 
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Performance  improvement  of  MigLogit  (C=2) 


Performance  improvement  of  MigLogit  (C=6) 


Performance  improvement  of  MigLogit  (C=10) 


Performance  improvement  of  MigLogit  (C=14) 


Performance  improvement  of  MigLogit  (C=4) 


Performance  improvement  of  MigLogit  (C=8) 


Performance  improvement  of  MigLogit  (C=12) 


Performance  improvement  of  MigLogit  (C=14) 


Figure  6:  Performance  of  MigLogit  with  different  choices  of  C,  in  the  UXO  detection  prob¬ 
lem.  The  vertical  axis  is  the  AUC  of  MigLogit  minus  the  AUC's  of  logistic  regression. 
The  primary  labeled  data  V]1  are  actively  selected  from  Vp.  From  top-left  to  right-bottom, 

(7  =  2,4, 6,  8, 10, 12, 14, 16. 

labeled  data  from  the  new  site  than  standard  classifiers.  The  results  for  the  actual  (measured) 
UXO  data  suggest  that  the  algorithm  captures  the  concept  drift  associated  with  realistic  prob- 
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lems  of  practical  importance. 

6.3  Robustness  of  MigLogit 

We  have  discussed  in  (4)  how  to  choose  C  in  MigLogit.  In  this  subsection,  we  show  that  when 
the  choice  is  not  accurate,  MigLogit  still  yields  robust  results. 

We  consider  the  same  UXO  data  and  use  the  same  experimental  settings  as  in  the  previous 
subsection,  except  that  we  vary  C  in  MigLogit  to  examine  its  robustness.  The  Vf  are  deter¬ 
mined  by  active  learning  as  described  in  Section  5.  We  consider  eight  different  values  of  C, 
C  —  2,4,  6,  8, 10, 12, 14, 16,  to  examine  the  differences  in  the  results  obtained  under  these  set¬ 
tings.  The  results  are  shown  in  Figure  6.  It  is  seen  that  over  this  wide  range  of  choices  for  C , 
MigLogit  consistently  yields  superior  performances  except  in  a  few  cases,  which  occur  when 
the  size  of  Vf  is  very  small  and  C  is  either  too  large  or  too  small.  These  results  demonstrate  the 
robustness  of  MigLogit  to  the  choice  of  C,  particularly  when  active  learning  is  invoked.  With 
different  C,  the  Vf  are  also  selected  differently,  which  counteracts  the  effect  of  C  and  increase 
the  robustness  of  MigLogit. 


7  Conclusions 

We  have  proposed  an  algorithm,  called  migratory  logistic  regression  (MigLogit),  for  learning  in 
the  presence  of  concept  change  between  the  (auxiliary)  training  data  Va  and  the  (primary)  test¬ 
ing  data  Vp.  The  basic  idea  of  our  method  is  to  introduce  an  auxiliary  variable  /i,  for  each 
example  (x“,  yf)  e  Va,  which  allows  x“  to  migrate  to  the  class  yf  when  it  cannot  be  correctly 
classified  along  with  xp  by  the  classifier.  The  migrations  of  V"  are  controlled  by  the  inequal¬ 
ity  constraint  J2,= i  y“/-L  <  C,  where  C  >  0  is  an  appropriate  bound  limiting  the  average 
migration.  The  primary  labeled  data  Vf  play  a  pivotal  role  in  correctly  learning  the  classifier, 
and  we  have  presented  a  method  to  actively  selecting  Vf,  which  enhances  the  adaptivity  of  the 
entire  learning  process.  We  have  developed  a  fast  learning  algorithm  to  enhance  the  ability  of 
MigLogit  to  handle  large  auxiliary  data  sets. 

The  results  from  both  synthesized  data  and  data  collected  at  actual  unexploded  ordnance 
(UXO)  sites  show  that  MigLogit  yields  significant  improvements  over  the  standard  logistic 
regression,  demonstrating  that  if  the  classifier  trained  on  Va  is  to  generalize  well  to  Vp,  the 
mismatch  between  V"  and  Vp  must  be  compensated. 

In  the  work  presented  here  it  was  assumed  that  we  had  an  existing  set  of  labeled  data 
Va,  for  which  the  goal  was  to  learn  relationships  with  a  primary  set  of  (unlabeled  or  partially 
labeled)  data  Vp.  In  some  problems  we  may  have  M  —  1  existing  labeled  data  sets,  indexed  by 
m  =  1,  2,  •  •  •  ,  M  —  1,  and  we  are  interested  in  learning  the  characteristics  (concept)  of  a  new 
(M-th)  unlabeled  or  partially  labeled  data  set.  Using  the  method  presented  here,  all  data  in  the 
existing  M  —  1  data  sets  would  be  combined  to  define  V"  .  There  is  a  question  as  to  whether  this 
is  the  most  effective  way  to  address  this  problem.  A  related  technique  for  handling  multiple 
data  sets  is  based  on  the  notion  of  multitask  learning  (MTL)  or  inductive  transfer  (Baxter  2000; 
Caruana  1997;  Yu  et  al.  2005).  Here  a  task  refers  to  classifier  design  based  on  a  specific  data  set. 
The  goal  in  multitask  learning  is  to  enhance  training  examples  used  to  learn  a  given  task  by 
borrowing  information  from  related  tasks.  Information  borrowing  is  accomplished  by  learning 
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the  multiple  tasks  simultaneously  under  a  unified  framework.  This  is  particularly  beneficial 
when  each  task  has  limited  training  examples,  since  information  borrowing  allows  examples 
of  related  tasks  to  be  utilized  when  learning  the  target  task.  Note  MTL  does  not  require  the 
tasks  to  be  ordered  in  time;  it  only  assumes  that  the  tasks  are  related  in  some  manner. 

Existing  MTL  algorithms  are  distinguished  by  the  way  information  borrowing  is  imple¬ 
mented.  In  a  neural  network,  in  which  each  output  node  can  encode  a  task  (Bakker  and  Heskes 
2003;  Caruana  1997;  Liao  and  Carin  2006),  information  borrowing  is  implemented  by  a  com¬ 
mon  internal  representations  such  as  hidden  nodes  and  input-to-hidden  weights.  The  method 
in  (Evgeniou  et  al.  2005)  employs  a  task-kernel  to  capture  the  similarity  between  any  two  tasks. 
The  task-kernel  is  used  to  construct  a  quadratic  regularization  term  for  the  parameters  across 
all  tasks,  which  implements  information  borrowing  from  one  task  to  another.  In  Bayesian  hi¬ 
erarchical  models  (Dominici  et  al.  1997),  a  common  prior  distribution  is  placed  over  the  model 
parameters  in  different  tasks  to  represent  the  information  shared  between  tasks.  In  nonpara- 
metric  Bayesian  models  (Xue  et  al.  2007),  information  borrowing  is  carried  out  by  by  a  common 
Dirichlet  process  (DP)  (Lerguson  1973)  employed  to  generate  the  nonparametric  prior  distribu¬ 
tion  over  the  model  parameter  in  each  task.  The  method  in  (Wu  and  Dietterich  2004)  learns  a 
classifier  based  on  a  weighting  of  two  tasks,  with  the  auxiliary  task  given  lower  weight  to  re¬ 
flect  that  it  has  a  discounted  contribution  to  the  classifier  learning.  Here  the  target  task  borrows 
information  from  the  auxiliary  one,  through  the  discounted  contribution. 

An  interesting  direction  for  future  research  involves  examination  of  the  relationship  be¬ 
tween  the  concept-drift  algorithm  presented  here  and  the  aforementioned  MTL  approaches, 
each  of  which  constitutes  a  method  for  implementing  transfer  learning.  In  particular,  it  is  of 
interest  to  examine  the  value  in  retaining  the  separation  of  the  M  —  1  labeled  data  sets,  as  in 
MTL,  versus  aggregating  them  to  define  Va.  Lor  the  UXO  problem  considered  as  a  practical 
problem  in  this  paper,  one  may  have  cleaned  M  —  1  previous  sites  before  considering  the  M-th 
(although  this  was  not  the  case  in  the  example  considered  here,  in  which  we  only  had  labeled 
data  from  one  previous  site).  This  line  of  investigation  will  be  the  focus  of  a  future  study. 
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Appendix 

Proof  of  Theorem  1:  Let  f(z)  be  the  first  derivative  of  f(z).  We  have  Ylf=i  /(&,:  +  zf  = 
EiLi/M  +  Eii/o*  f(x  +  bfjdx.  The  first  term  on  the  right  side  is  a  constant  and  hence, 
the  problem  in  (13)  is  equivalent  to 


N  rzi 

max{zj}  V'  /  f\bi  +  x)dx  (A-l) 

*= i  Jo 

Because  f"(z)  <  0,  we  have  for  any  T\  <  r2  that  /'(ti  +  x)  >  /'(t2  +  x)  and  consequently 

f0A  fin  +  x)dx>  f0A  f(j2  +  x)dx  (A_2) 

V  T\  <  r2  and  A  >  0 

By  (12),  there  exists  0  <  r  <  n(bn+ 1  -  bn)  such  that  R  =  nbn  -  Y2=i  ^  +  r  =  YjI=i 
where  Ak  =  bk+ 1  —  bk  for  k  =  1,  •  •  •  ,n  —  1,  and  An  =  rjn.  We  now  use  (A-2)  to  distribute 
Ai,  2A2,  •  •  •  ,  nAn  to  zi,  z2,  ■  ■  ■  ,  zn  such  that  the  resulting  {zi}  maximize  (A-l).  As  Ak  >  0  for 
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k  =  1,  -  ■  ■  ,  n,  and  any  distribution  of  {kAk}%=1  to  {zk}k=1  makes  J2iLi  A  —  J2k= 1  =  the 

constraints  of  (14)  and  (15)  are  automatically  satisfied. 

Initially  z%  =  0  for  i  —  1,  2,  •  •  •  ,  N. 

As  Ai  =  b2  —  bi  >  0,  by  (A-2),  fAl  f'(bi  +  x)dx  >  fAl  f'(b2  +  x)dx,  therefore  Ax  is  distributed 
to  z\,  i.e.,  z\  z\  +  Ax,  which  makes  bi  +  z\  —  b2. 

Similarly  A2  =  b3  —  b2  >  0,  by  (A-2),  fA'2  f(b2  +  x)dx  >  fA'2  f'{b3  +  x)dx,  therefore  2A2  is 
equally  distributed  to  z\  and  z2r  i.e.,  z\  zx  +  A2  and  z2  ^2  +  A2,  which  makes  b\  +  z\  = 

b2  +  z2  =  63. 

Generally,  A;Afc  is  equally  distributed  to  zi,z2,  ■  ■  ■  ,  zk.  After  the  distribution  of  kAk,  k  = 
1, 2,  •  •  •  ,  n,  we  have  zk  =  Y17=k  f°r  ^  =  1,2,---  ,  n  and  =  0  for  k  =  n  +  1,  n  +  2,  •  •  •  ,  A, 
which  is  equal  to  the  solution  in  (16).  Because  the  problem  is  strictly  concave,  the  solution  is 
unique  and  globally  optimal.  □ 

Derivation  of  Equation  (19):  By  definition  of  logistic  regression,  w  is  the  parameter  of  the 
conditional  distribution  Pr(y  |  x)  =  cr(yw7  x),  with  x  given  and  fixed.  Letg  =  <91rux(f/wTx)/<9w  = 
[1  -  a(ywTx)]yx.  Then  Ey(ggr)  =  52y=-i,i  a(vwT x)[l  -  u(r/wTx)]2xxT.  Using  a(-wTx)  = 
1  —  cr(wTx),  we  obtain  Ey(gg7  )  =  cr(wTx)[l  —  cr(wTx)]xxT .  Summing  E(ggT)  over  all  primary 
and  auxiliary  data  points  (assuming  the  data  are  independent),  we  obtain  the  formula  of  Q.  □ 
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Abstract — Semi-supervised  learning  and  active  learning  are 
considered  for  UXO  detection.  Semi-supervised  learning  algo¬ 
rithms  are  designed  using  both  labeled  and  unlabeled  data,  where 
here  labeled  data  corresponds  to  sensor  signatures  for  which 
the  identity  of  the  buried  item  (UXO/non-UXO)  is  known;  for 
unlabeled  data  one  only  has  access  to  the  corresponding  sensor 
data.  Active  learning  is  used  to  define  which  unlabeled  signatures 
would  be  most  informative  to  improve  classifier  design,  if  the 
associated  label  could  be  acquired  (where  for  UXO  sensing  the 
label  is  acquired  by  excavation).  A  graph-based  semi-supervised 
algorithm  is  proposed,  employing  the  idea  of  a  random  Markov 
walk  on  a  graph,  thereby  exploiting  knowledge  of  the  data 
manifold  (where  the  manifold  is  defined  by  both  the  labeled 
and  unlabeled  data).  The  resulting  algorithm  is  then  used  to 
infer  labels  for  the  unlabeled  data,  providing  a  probability  that 
a  given  unlabeled  signature  corresponds  to  a  buried  UXO.  An 
efficient  active-learning  procedure  is  developed  for  this  algorithm, 
based  on  a  mutual-information  measure.  In  this  manner  one 
initially  performs  excavation  with  the  purpose  of  acquiring  labels 
to  improve  the  classifier,  and  once  this  active-learning  phase  is 
completed  the  resulting  semi-supervised  classifier  is  then  applied 
to  the  remaining  unlabeled  signatures,  to  quantify  the  probability 
that  each  such  item  is  UXO.  Example  classification  results  are 
presented  for  an  actual  UXO  site,  based  on  electromagnetic 
induction  and  magnetometer  data.  Performance  is  assessed  in 
comparison  to  other  semi-supervised  approaches,  as  well  as  to 
supervised  algorithms. 

I.  Introduction 

Unexploded  ordnance  (UXO)  correspond  to  explosive  de¬ 
vices  ( e.g .,  bombs)  that  did  not  explode  upon  impact  with  the 
ground,  and  that  are  subsequently  buried  intact  or  partially 
intact.  Some  UXO  may  also  exist  on  the  surface  of  the 
ground,  but  we  here  assume  these  are  removed  via  manual 
inspection,  and  therefore  this  paper  focuses  on  detecting  buried 
UXO.  There  are  several  sensing  techniques  that  have  been 
developed  over  the  last  several  decades  for  detection  of  buried 
UXO.  Most  widely  used  among  these  are  electromagnetic 
induction  (EMI)  [1,  2,  3]  and  magnetometers  [4].  Both  of 
these  approaches  are  based  on  sensing  magnetic  signatures. 
An  EMI  sensor  is  an  active  approach,  whereby  electromagnetic 
radiation  is  emitted,  and  one  measures  the  signals  scattered  off 
targets.  Such  that  one  achieves  sufficient  ground  penetration, 
EMI  systems  are  typically  designed  to  operate  at  kilohertz  fre¬ 
quencies  (in  the  inductive  regime).  By  contrast,  magnetometers 
are  passive  sensors,  which  measure  the  static  magnetic  field 
of  the  earth,  and  hence  the  presence  of  ferrous  targets,  which 
yield  a  corresponding  perturbation  to  the  earth’s  magnetic 
field. 


A  UXO  is  any  explosive  device  that  has  not  detonated, 
and  therefore  UXO  are  dangerous  if  disturbed.  However, 
although  the  device  didn’t  detonate,  in  many  cases  the  item 
is  deformed  upon  impact,  with  possible  components  {e.g.,  tail 
wings)  broken  off.  In  addition,  there  are  many  different  types 
of  explosives  (bombs)  that  may  have  been  deployed.  These 
factors  significantly  complicate  one’s  ability  to  distinguish 
UXO  from  non-UXO  based  on  the  EMI  and/or  magnetometer 
signature.  Specifically,  many  types  of  buried  benign  metal 
items  are  often  readily  confused  for  UXO,  based  on  the 
sensor  signature.  Consequently,  the  unnecessary  excavation  of 
non-UXO  items  often  constitutes  the  principal  cost  of  UXO 
cleanup  (there  are  typically  far  more  non-UXO  buried  metal 
items  than  there  are  actual  UXO).  Therefore,  classification 
of  UXO  constitutes  a  significant  sensing  and  classification 
challenge. 

Classification  using  EMI  and  magnetometer  sensors  is  typ¬ 
ically  not  performed  directly  on  the  measured  data,  but  on 
features  extracted  therefrom.  Specifically,  parametric  models 
have  been  developed  for  the  response  of  targets  as  viewed 
from  such  sensors,  with  most  of  these  models  based  on  a 
dipole  approximation  [3].  The  parameters  extracted  from  the 
models,  when  fitting  is  performed  to  the  measured  data,  are 
typically  employed  to  constitute  feature  vectors  within  the 
subsequent  classification  algorithm.  Most  of  these  algorithms 
are  supervised,  in  the  following  sense.  A  set  of  labeled 
feature  vectors  are  assumed  given  (the  identity,  UXO/non- 
UXO,  of  each  feature  vector  is  known),  and  these  data  are 
used  to  design  a  classifier.  Numerous  such  classifiers  have 
been  considered  for  UXO  detection,  such  as  kernel  matching 
pursuits  [5],  support  vector  machines  [3],  and  likelihood-ratio 
tests  [3],  There  are  two  limitations  of  such  approaches:  (i)  the 
assumption  of  the  presence  of  an  appropriate  labeled  data  set 
is  tenuous  in  many  cases,  and  (ii)  even  when  such  labeled  data 
are  available,  a  purely  supervised  algorithm  doesn’t  exploit  the 
contextual  information  provided  by  the  unlabeled  data. 

General  interest  in  these  latter  two  issues  has  motivated  re¬ 
cent  research  in  the  machine  learning  community  Specifically, 
active  learning  [6,  7]  is  a  framework  whereby  the  acquisition 
of  labeled  data  is  integrated  within  classifier  design.  Using 
appropriate  information-theoretic  measures,  an  active-learning 
algorithm  asks  which  of  the  unlabeled  feature  vectors  would  be 
most  informative  for  classifier  design  if  the  associated  labels 
could  be  made  available.  This  idea  has  been  applied  previously 
in  the  context  of  UXO  detection  [5],  The  new  aspect  of  the 
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work  considered  here  is  that  this  active-learning  framework  is 
placed  within  the  context  of  a  semi-supervised  learning  setting. 
Specifically,  in  addition  to  actively  acquiring  the  labeled 
data  (performing  item  excavations  selectively  for  the  purpose 
of  algorithm  learning),  a  semi-supervised  algorithm  exploits 
contextual  information  provided  by  all  of  the  unlabeled  data 
(the  classification  of  any  one  unlabeled  feature  vector  is  placed 
within  the  context  of  all  unlabeled  feature  vectors).  In  the  UXO 
problem  the  EMI/magnetometer  data  are  often  all  collected 
at  once,  typically  using  a  cart-based  system  [4];  recall  that 
the  UXO  of  interest  are  all  buried,  and  therefore  they  are 
not  dangerous  until  excavation  begins.  Therefore,  one  may 
perform  feature  extraction  on  all  of  the  signatures  at  once, 
and  the  contextual  information  provided  by  these  data  may  be 
of  utility  in  improving  classification  performance. 

Semi-supervised  learning  has  been  an  area  of  significant 
recent  interest  in  the  machine-learning  community  [8,  9,  10, 
11,  12,  13,  14,  15],  where  exploitation  of  the  information 
available  in  the  unlabeled  data  has  been  demonstrated  to  often 
add  value.  To  our  knowledge  this  paper  represents  the  first 
use  of  such  an  approach  as  applied  to  the  UXO  problem.  As 
discussed  below,  the  semi-supervised  approach  proposed  here 
is  new  in  its  own  right,  and  has  advantages  relative  to  other 
such  techniques  currently  in  the  literature. 

To  date,  there  have  been  several  semi-supervised  meth¬ 
ods  developed.  The  generative-model  method,  an  early  semi- 
supervised  method,  estimates  the  joint  probability  of  data 
and  labels  via  expectation-maximization  (EM),  treating  the 
missing  labels  of  unlabeled  data  as  hidden  variables;  this 
method  was  studied  in  statistics  for  mixture  estimation  [16] 
and  has  been  reformulated  for  semi-supervised  classification 
[15],  Co-training  [13],  another  early  method,  exploits  two 
independent  subvectors  of  features,  using  one  to  provide 
the  label  estimates  for  the  other;  co-training  has  received 
renewed  interest  recently,  particularly  theoretically.  The  semi- 
supervised  support  vector  machine  (SVM)  [12]  represents  a 
more  recent  method,  which  maximizes  the  margin  between 
classes,  taking  into  account  both  labeled  and  unlabeled  data. 
Graph-based  methods  [11,  10,  14,  9],  the  main  focus  of  current 
research  in  semi-supervised  learning,  exploits  the  assumption 
that  strongly  connected  data  points  (in  feature  space)  should 
share  the  same  label,  and  utilizes  spectral  graph  theory  to 
quantify  the  between-data  connectivity.  For  a  more  complete 
review  of  the  literature,  see  [8]. 

Most  graph-based  algorithms  operate  in  a  transductive 
fashion,  i.e.,  they  directly  learn  the  labels  of  the  unlabeled 
data,  instead  of  learning  a  classifier  first  and  then  using  the 
classifier  to  infer  the  unseen  labels  (inductive  learning).  While 
transductive  algorithms  avoid  the  problem  of  model  selection 
for  a  classifier,  they  lack  a  principled  way  of  predicting  the 
labels  of  data  out  of  the  training  set.  The  work  in  [9]  addresses 
this  problem  by  constructing  a  graph-based  prior  distribution 
on  the  parameters  of  a  classifier  and  learns  the  classifier 
by  maximizing  the  posterior  (MAP  estimation);  the  prior 
utilizes  both  labeled  and  unlabeled  data,  thus  enforcing  semi- 
supervised  learning.  Several  drawbacks  are  inherent  in  the 
algorithm  in  [9].  For  example,  the  hyper-parameter  balancing 
the  importance  of  the  prior  relative  to  the  data  likelihood  needs 


to  be  learned. 

In  this  paper,  with  a  focus  on  the  UXO-sensing  application, 
we  present  an  algorithm  for  learning  parametric  classifiers  on 
a  partially  labeled  data  manifold,  by  representing  the  manifold 
as  a  graph;  each  vertex  on  the  graph  represents  a  data  point 
and  the  weighted  edge  between  two  vertices  manifests  the 
immediate  connectivity  between  the  corresponding  data  points. 
We  are  motivated  by  the  work  in  [11]  and  build  the  (-step 
connectivity  between  data  points  via  a  Markov  random  walk 
on  a  manifold.  To  account  for  heterogeneities  in  the  data 
manifold,  we  let  the  random  walk  take  different  step-sizes 
at  different  data  locations;  each  step-size  dictates  a  Markov 
transition  matrix  and  we  select  the  step-size  to  assemble  the 
transition  matrix  for  the  entire  manifold. 

The  remainder  of  the  paper  is  organized  as  follows.  In  Sec¬ 
tions  II  and  III  we,  respectively,  discuss  the  semi-supervised 
learning  algorithm  and  the  active-learning  framework.  These 
discussions  are  presented  in  a  general  setting,  applicable  to 
any  remote-sensing  problem  for  which  (i)  all  of  the  unla¬ 
beled  data  are  available  simultaneously,  and  (ii)  there  is  an 
opportunity  to  selectively  acquire  labels  on  a  subset  of  the 
unlabeled  feature  vectors.  In  the  work  considered  here  these 
labels  may  be  acquired  via  selective  excavation,  while  in 
other  settings  one  may  employ  a  human  analyst  or  potentially 
another  (higher-resolution)  sensor,  selectively  deployed.  The 
specific  application  to  UXO  sensing  is  discussed  in  Sec. 
IV,  wherein  the  sensors  and  feature  vectors  considered  are 
described.  Results  are  presented  for  an  actual  UXO  site,  using 
magnetometer  and/or  EMI  sensor  data,  and  comparisons  are 
made  to  other  approaches  (other  classes  of  supervised  and 
semi-supervised  algorithms).  Conclusions  from  this  work  are 
provided  in  Section  V. 

II.  The  Semi-supervised  Learning  Algorithm 

A.  The  Graph  Representation  of  a  Partially  Labeled  Data 
Manifold 

Let  G  =  (A\W)  be  a  graph,  where  X  =  {xi,  X2,  •••, 
xjv}  is  the  set  of  vertices  and  W  =  [xifwxN  is  the  affinity 
matrix  with  the  (i,j)-th  element  indicating  the  strength 
of  immediate  connectivity  between  vertices  x,  and  x(.  For 
the  purpose  of  data  classification,  the  vertex  set  X  coincides 
with  the  set  of  data  points  (labeled  or  unlabeled),  and  Wjj  is  a 
quantitative  measure  of  the  closeness  of  data  points  x,;  and  x;. 
In  the  semi-supervised  setting,  only  a  subset  of  X  are  provided 
with  class  labels,  and  the  remaining  data  points  are  unlabeled, 
and  therefore  we  have  a  partially  labeled  graph. 

Although  there  are  many  alternative  ways  of  defining  the 
connectivity  ay:l ,  here  we  consider  a  radial  basis  function 

=  exp(-^X*2J^  )  (1) 

where  ||  •  j|  represents  the  Euclidean  norm.  While  the  affinity 
matrix  may  provide  a  reasonable  local  similarity  among  the 
data  points,  it  is  not  a  good  representation  of  the  global  simi¬ 
larity  measure  of  the  data  sets.  Following  [11],  we  construct  a 
Markov  random  walk  based  on  the  affinity  measure,  which  is 
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capable  of  incorporating  both  the  high-density  clustering  prop¬ 
erty  and  the  manifold  structure  of  the  data  set.  Specifically,  we 
induce  a  Markov  transition  matrix  A  =  [a,j];vXN,  where  the 
(f,  j)-th  element 


aij 


Wik 


(2) 


gives  the  probability  of  walking  from  x,;  to  Xj  by  taking  a 
single  step.  In  general  we  are  interested  in  a  /-step  random 
walk,  the  transition  matrix  of  which  is  given  by  A  raised  to  the 
power  of  t,  i.e.,  A'  —  [a-^jvxtv-  The  Af  is  row  stochastic, 
where  each  element  af  represents  the  probability  that  the 
Markov  process  starts  front  x,  and  ends  at  x(  by  taking  f- 
step  random  walks.  As  a  special  case.  A'  degenerates  to  an 
identity  matrix  when  t  =  0,  which  means  one  can  only  stay 
at  a  single  data  point  when  no  walk  is  performed. 

In  specifying  the  Markov  transition  matrix  in  (1)  we  have 
used  a  distinct  a  for  each  data  point  x.  In  the  random  walk, 
a  can  be  thought  of  as  the  step-size.  Therefore  location- 
dependent  step-sizes  allow  one  to  account  for  possible  het¬ 
erogeneities  in  the  data  manifold  —  at  locations  where  data 
are  densely  distributed  a  small  step-size  is  enough,  whereas  at 
locations  where  data  are  sparsely  distributed  a  large  step-size 
is  necessary  to  connect  a  data  point  to  its  nearest  neighbor. 
A  simple  choice  of  the  heterogeneous  a  is  to  let  <Ji  to  be 
a  fraction  of  the  shortest  Euclidean  distance  between  x,;  and 
all  other  data  points  in  X.  This  ensures  each  data  point  is 
immediately  connected  to  at  least  one  neighbor. 


B.  Neighborhood-Based  Learning 

Any  two  data  points  x,  and  Xj  are  said  to  be  /-step 
neighbors,  denoted  as  xj  ~  x^,  if  af  >  0.  Then  J\ft(xi)  = 

{x  :  x  ~  Xj}  C  X,  which  represents  the  set  of  f-step 
neighbors  of  x,;,  is  called  the  f-step  neighborhood  of  x,. 
When  f  =  0,  the  neighborhood  shrinks  to  a  single  data  point, 
A/o(x.j)  =  {x;}.  We  define  the  probability  of  label  yi  given 
the  f-step  neighborhood  of  x,  as 

N 

p(yi\Aft(xi),0)  =  Y  aij  P(yi\xj’e)  (3) 
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where  the  magnitude  of  af  automatically  determines  the 
contribution  of  Xj  to  the  neighborhood,  thus  we  are  allowed 
to  run  the  index  j  over  the  entire  X.  Expression  p(yi\xj,0) 
is  the  probability  of  label  yi  given  a  single  data  point  xy 
(zero-step  neighborhood)  and  it’s  represented  by  a  standard 
probabilistic  classifier  parameterize  by  6.  In  this  paper  we 
consider  binary  classification  with  y  G  {—1,1},  and  choose 
the  form  of  p(yi\xi,9)  as  logistic  regression  classifier 


p(yi\xj,6) 


1 

1  +  exp  (-yi6TXj) 


(4) 


where  we  assume  a  constant  element  1  is  prefixed  to  each 
feature  vector  x  (the  prefixed  x  is  still  denoted  as  x  for 
notational  simplicity),  thus  the  first  element  in  d  is  a  bias 
term.  Arbitrarily  one  may  set  y  =  1  as  corresponding  to  a 
UXO,  and  y  =  —  1  as  corresponding  to  a  non-UXO. 


We  distinguish  between  the  classifier  in  (3)  and  the  typical 
logistic  regression  classifier 


p(yi\xi,0) 


1 

1  +  exp  (-yI0Txi) 


(5) 


The  fundamental  difference  between  these  two  is  that  the 
logistic -regression  classifier  predicts  yi  using  x,  alone,  while 
the  semi-supervised  approach  considered  here  predicts  yi  by 
x,  and  the  feature  vectors  in  the  neighborhood  of  x,;.  The 
neighborhood  of  x,  is  formed  by  all  xy’s  that  can  be  reached 
from  X;  by  f-step  random  walks,  with  each  x.y  contributing 
to  the  prediction  of  yi  in  proportion  to  af,  the  probability  of 
walking  from  x,  to  xy  in  f  steps.  The  role  of  neighborhoods 
is  then  conspicuous  —  in  order  for  x,  to  be  labeled  yt,  each 
neighbor  Xj  must  be  labeled  consistently  with  y%,  in  the  degree 
proportional  to  (i  f-  ;  in  such  a  manner,  y,  implicitly  propagates 
over  the  neighborhood.  By  taking  the  neighborhoods  into 
account,  it  is  possible  to  learn  a  classifier  with  only  a  few 
labels  present  and  yet  the  classifier  learned  is  much  less  subject 
to  over-fitting  than  when  ignoring  the  neighborhoods.  This  is 
addressed  in  greater  detail  below. 


Let  C  C  {1, 2,  •  •  •  ,  N}  denote  the  set  of  indices  of  labeled 
data.  Assuming  the  labels  are  conditionally  independent,  we 
obtain  the  likelihood  function 

p({yi,i  G  £}|{A/'t(xi)  :  i  G  £},  9)  =  JJp(yi|A/}(xi),0) 

iec 

N 

=nYaijp(yi  ixJ’0)  (6) 

ie£  j=i 

which  is  the  joint  probability  of  observed  labels  given  the  f- 
step  neighborhood  of  each  corresponding  data  point.  Estima¬ 
tion  of  9  may  be  achieved  by  maximizing  the  log-likelihood, 
which  however  may  yield  over-fitting,  especially  when  the 
number  of  labeled  samples  is  small.  To  enforce  sparseness 
of  6  (sparseness  has  been  demonstrated  as  an  important 
property  [17],  discouraging  overfitting),  we  impose  a  zero- 
mean  Gaussian  prior  on  each  dimension  of  Q , 

P(S|A)=  (2^eXp(-5"'Ae)  <7) 

where  A  =  diag{ Ai,  A2, ...,  A^}  are  hyper-parameters,  d  is  the 
dimensionality  of  x.  Each  hyper-parameter  has  an  independent 
Gamma  distribution,  resulting  in 

d 

p(A\a,P)  =  n  Gamma(Ai|aj,  Pi) 

i= 1 

=  n  (8) 

i= 1 [ai> 

Marginalizing  A,  we  obtain  the  prior  distribution  conditional 
directly  on  a  and  ft, 

p{9\a,  ft)  =  J p(6\A)p(A\a,  ft)  dA  (9) 

The  posterior  of  6  follows  from  (6)  and  (9), 

p(d\a,ft,{yi,Nft(xi)  :  i  G  £}) 
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N 

=  Z  1 II  aU  e) 

iec  j=i 

where  Z  is  a  normalization  constant.  We  are  interested  in  the 
maximum  a  posterior  (MAP)  estimate  of  0,  which  maximizes 
(10)  or,  equivalently, 

0 )  d='  lnp(9\a,/3,{yi,N't(xi)  :  i  e  C}) +  lnZ 

N 

i= l 

+  ln  J  p(0\A)p(A\a,  (3)  dA  (11) 

The  0  obtained  by  maximization  of  £(0)  generally  is  not 
subject  to  over-fitting  due  to  two  reasons  —  the  neighborhoods 
incorporated  into  the  first  term  of  t(9)  encourages  smoothness 
along  the  manifold,  and  the  second  term  of  1(6)  enforces 
sparseness  of  6. 


j p(G\A)p(A\a,  p)  dA(10) 


C.  The  Learning  Algorithm 


We  maximize  (11)  by  employing  an  expectation- 
maximization  (EM)  algorithm.  For  any  {<5y  :  Sij  > 

0.  Sij  =  1}  and  {q( A)  :G  q(A)dA  =  1},  we  apply 
Jensen’s  inequality  to  the  righthand  side  of  (11)  to  obtain  the 
lower  bound 


m  >  qw,  ?)  “‘J-  E  E  j«  in 

+  /»( A)lnP(8|A)(iA|“,ftdA  (12) 


9(A) 


where  the  equality  holds  when 

s  _  P(yi\xj,0)afj 

T,k=iP(yi\xk,0)a^ 


(13) 


=  p(6\A)p(A\a,P) 

q[  ’  Jp(9\A)p(A\a,P)dA 


(14) 


The  EM  algorithm  consists  of  iteration  of  the  following  two 
steps. 


1)  E-step:  computing  {8l:j}  and  <;(A)  using  (13)  and  (14); 

2)  M-step:  compute  the  re-estimate  of  0  as 

9  =  argmaxQ(0|(i,  q)  (15) 

9 

The  convergence  is  monitored  by  checking  £(G),  which  is 
guaranteed  to  mono  tonic  ally  increase  over  the  EM  iterations. 

There  are  two  noticeable  points  regarding  the  technical 
details.  First,  since  (8)  is  conjugate  to  (7),  q( A)  is  of  the  same 
form  as  (8)  with  updated  hyper-parameters  a,  P, 


<z(A) 


d  1  1 

'[I  Gamma(Ai|ai  +  Pi  +  -0}) 


i=  1 
d 

n 

i= 1 


(Pi 


r(«i  +  i) 


A“‘  5e"A’(ft+|e,-)  (16) 


and  the  integral  in  the  dominator  of  (14)  has  an  analytic  form 


J  p(6\A)p(A\a,  P)dA 

=  1  p?  r(qi+4) 

{27ryi/-2  r(ai)  (pi+i02 


(17) 


which  is  useful  in  checking  the  convergence  of  1(G)  in  (11). 

Secondly,  in  computing  Q(G\S,q)  by  (12),  one  needs  to 
compute  7 (9)  d='  f  q(A)  lnp(0|A)dA,  and  it  is  found  that 

7(0)  =  -±9TEq(A\9)9 

=  — j)0Tdiag  pEg(Ai),Eq(A2),  •  •  •  ,E9(Ad)]0  (18) 

with 


Eq(Ai) 


Oii  +  | 

h  +  ¥i' 


(19) 


III.  Active  Learning 

In  the  UXO-classification  problem,  it  is  a  given  that  exca¬ 
vation  will  ultimately  be  performed.  The  principal  objective  is 
to  excavate  as  high  a  percentage  of  UXO  as  possible,  while 
leaving  as  much  of  the  non-UXO  as  possible  unexcavated. 
Recall  that  the  primary  expense  in  UXO  cleanup  is  the 
excavation  of  non-UXO  items,  since  the  density  of  such  is 
typically  much  higher  than  the  amount  of  UXO,  and  the 
sensor  signatures  of  UXO  are  often  very  similar  to  those 
of  many  types  of  non-UXO.  Given  that  excavation  will  be 
performed  in  any  case,  one  may  ask  whether  the  initial  set  of 
excavations  may  be  performed  with  the  purpose  of  improving 
the  performance  of  the  algorithm.  Specifically,  one  may  ask 
which  unlabeled  sensor  signature  would  be  most  informative 
to  improved  classifier  performance  if  the  associated  label 
could  be  made  available.  As  discussed  below,  this  question 
is  answered  in  a  quantitative  information-theoretic  manner. 
When  the  expected  information  content  of  such  an  excavation 
drops  below  a  prescribed  threshold,  excavation  for  the  purpose 
of  improved  learning  is  terminated,  and  then  the  algorithm 
is  used  to  define  the  probability  that  all  remaining  unlabeled 
signatures  correspond  to  UXO.  Importantly,  in  active  learning 
the  algorithm  desires  to  learn  about  the  properties  of  the  UXO 
and  non-UXO  at  the  site,  and  therefore  in  this  phase  an 
excavated  non-UXO  should  not  be  termed  a  “false  alarm”. 
Such  active  learning  has  been  performed  previously  in  a 
related  UXO-cleanup  study  [5];  the  distinct  character  of  the 
algorithm  discussed  below  is  that  this  process  is  here  placed 
within  the  context  of  semi-supervised  learning. 


A.  Active  Learning  with  Semi-Supervised  Classifier 

For  active  label  selection,  we  consider  a  Gaussian  approxi¬ 
mation  of  the  posterior  of  the  classifier 

p(9\D)  ~  Af(0\0,  H_1)  (20) 

where  9  is  the  estimate  of  the  classifier  learned  from  the  above 
EM  algorithm,  and  H  is  the  posterior  precision  matrix  H  = 
V2(-logp(0|{yJ:,A/'t(xi)  :  i  G  £}).  By  treating  7(0)  in  (18) 
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as  deterministic,  we  obtain  an  evidence-type  approximation 

[17]: 

N 

H  =  ^2^2sijp(yi\xj,d)(l-p(yi\xj,9))y:jxJ 

iec  j=i 

-V2  In  7(0)  (21) 

With  one  more  data  point  x,:*  with  label  yrs  as  the  next  labeled 
data,  assuming  that  the  MAP  estimate  of  9  remains  the  same 
after  including  the  new  data  point,  then  the  posterior  precision 
changes  to 

N 

h'=  y 

i'G-CU-fi*}  j= 1 

-V2  In  7(0)  (22) 


For  active  label  selection,  we  could  further  simplify  the  equa¬ 
tion  for  the  precision  matrix  by  considering  the  degenerated 
connectivity  matrix  A^t=0\  which  is  an  identity  matrix,  such 
that 
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1 ,  for  i  =  j 
0,  for  i  ±  j 


Following  this,  the  new  precision  matrix  becomes 

H'  =  H+p(t/«|x«,0)(l  -p(yi*|x«,0))xi*x^ 


(23) 

(24) 


Our  criterion  for  active  learning  is  to  choose  the  feature  vector 
for  labeling  that  maximizes  the  mutual  information  between 
the  classifier  9  and  the  new  data  point  to  be  labeled,  which 
is  the  expected  decrease  of  the  entropy  of  9  after  x,„  and  yt,, 
are  observed. 


1  = 


1 
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|H| 


l0g{l+p(t/i*|x2;* 


0)[l-p(yi*  |xi*,0)]x£tf  ^i*}  (25) 


The  mutual  information  I  is  large  when  p(j/i*|xj*,  9)  «  0.5, 
therefore,  our  active  learning  prefers  label  acquisition  on 
samples  with  uncertain  classification,  based  on  the  current 
classifier  based  upon  available  labeled  data.  Further,  consid¬ 
ering  the  term  x^7T-1Xi*,  the  mutual  information  criterion 
prefers  samples  with  high  variance. 

The  assumption  that  the  mode  of  the  posterior  distribution 
of  the  classifier  remains  unchanged  with  one  more  labeled 
data  point  is  not  good  at  the  beginning  of  the  active  learning 
procedure.  However,  empirically  we  have  found  that  it  is  a 
very  good  approximation  after  the  active  learning  procedure 
has  acquired  as  few  as  15  labels,  for  the  examples  considered 
here.  Further,  this  assumption  obviates  the  need  to  re -train 
the  classifier  after  each  new  label  is  acquired,  thus  saving 
computational  cost. 


B.  Other  Active-Learning  Approaches 

When  presenting  results,  we  will  make  comparisons  to 
other  semi-supervised  learning  algorithms,  and  therefore  we 
briefly  discuss  how  active  learning  is  implemented  in  these 
approaches.  In  the  semi-supervised  work  of  [9],  the  authors 
also  proposed  a  scheme  for  active  label  acquisition,  which 


reduces  to  the  same  criterion  as  (25).  However,  our  classifier 
is  different  from  theirs,  and  consequently  active  learning 
based  on  our  classifier  will  yield  different  results  from  those 
produced  by  the  algorithm  in  [9]. 

As  pointed  out  in  the  Introduction,  our  work  was  motivated 
by  that  in  [11],  where  a  similar  Markov  random  walk  graph  is 
defined.  Instead  of  training  a  parametric  inductive  classifier  as 
in  (3),  the  EM  algorithm  in  [11]  learns  the  classification  proba¬ 
bility  of  the  unlabeled  data  directly  (transductive).  Specifically, 
the  probability  of  the  label  for  x,  is  defined  as 

N 

p(yi\M(*i))  =  Y  aif  p(y*  Ixj)  (26) 

j=l 

and  the  estimation  criterion  is  to  maximize  the  log-likelihood 
of  the  labeled  data  points, 

N 

^1°gp({t/i|A/'t(Xi))  =  Yl°gYaij  p(yi\x^  (27) 

i£C  j  —  1 

which  may  be  performed  via  an  EM  method.  There  is  no  active 
learning  algorithm  provided  in  [11],  Here  we  therefore  propose 
an  active-learning  criterion  for  the  non-parameteric  classifier 
in  [11],  which  selects  Xj*  to  be  the  next  labeled  data  point  by 
maximizing 

P(yi*|xj*)(l  -p(2/**lxi*))  (28) 

In  this  approach  we  simply  acquire  a  label  on  that  unlabeled 
sample  for  which  the  current  classifier  is  most  uncertain; 
clearly  this  simple  approach  may  be  applied  to  any  classifier. 

IV.  Application  to  UXO  Detection 
A.  Magnetometer  and  EMI  Sensor  Data  Considered 

To  evaluate  the  proposed  algorithm,  we  applied  it  to  a  UXO 
data  set  from  an  actual  former  bombing  range.  This  data  set 
was  collected  by  the  Multi-sensor  Towed  Array  Detection 
System  (MTADS)  [4],  This  system  is  composed  of  arrays 
of  full-field  cesium  vapor  magnetometers  and  time-domain 
electromagnetic  pulsed  induction  sensors.  The  magnetometers 
were  Geometries  Model  822ROV,  while  the  EMI  sensors 
were  highly-modified  Geonics  EM-61  sensors.  The  data  were 
collected  at  a  bombing  target  on  the  Badlands  Bombing  Range 
on  the  Ogala  Sioux  Reservation  in  Pine  Ridge,  South  Dakota. 
The  UXO  items  present  at  the  site  included  M  38  (100  lb.) 
sand-filled  practice  bombs,  M  57  (250  lb.)  practice  bombs, 
2.25  in.  and  2.75  in.  rocket  bodies  and  rocket  warheads,  and 
ordnance  scrap  (such  as  tail  fins  and  casing  parts).  The  details 
of  how  these  measured  data  are  analyzed  with  magnetometer 
and  EMI  dipole  models  has  been  described  in  detail  elsewhere 
[3],  and  these  same  techniques  were  applied  to  extract  feature 
vectors  from  the  data  considered  here. 

The  data  associated  with  a  given  item  under  test  was  man¬ 
ifested  in  one  of  three  variations:  (i)  only  magnetometer  data 
were  available,  (ii)  only  EMI  data  were  available,  or  (iii)  both 
magnetometer  and  EMI  data  were  available.  These  different 
variations  were  tied  to  the  details  of  the  data  collection,  and  to 
the  quality  of  the  data  acquired  for  each  of  the  two  modalities. 
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For  EMI  sensor  data  alone,  there  are  230  clutter  cases  (non- 
UXOs)  and  44  UXOs.  For  the  magnetometer  sensor  data  alone, 
there  are  719  non-UXOs  and  79  UXOs.  Concerning  the  case 
for  which  data  from  both  the  magnetometer  and  EMI  sensors 
are  available,  there  are  228  Non-UXOs  and  44  UXOs.  For 
the  EMI  data  the  feature  vector  is  of  dimension  10,  for  the 
magnetometer  the  feature  vector  is  of  length  9,  and  when 
both  are  used  the  two  types  of  features  are  concatenated. 
Before  processing  each  feature  is  centered  and  normalized. 
Specifically,  we  compute  the  mean  and  variance  for  each 
dimension  of  the  features;  each  feature  is  shifted  by  subtracting 
its  mean  and  then  divided  by  its  variance.  The  feature  vectors 
from  this  data  set  are  available  to  other  researchers,  upon 
request  to  the  authors. 

B.  Detection  Results  for  Non-Active  Classification:  Transduc- 
tive 

In  semi-supervised  learning,  as  discussed  above,  there  are 
two  frequently  applied  settings.  In  a  transductive  algorithm 
[11]  it  is  assumed  that  all  of  the  labeled  and  unlabeled  data 
are  available  simultaneously,  and  the  algorithm  is  designed 
to  classify  the  unlabeled  data,  employing  the  data-manifold 
information  provided  by  both  the  labeled  and  unlabeled  data. 
Importantly,  if  a  new  unlabeled  example  was  added,  then 
the  whole  transductive  learning  process  would  have  to  begin 
anew.  In  an  inductive  semi-supervised  learning  algorithm  [9] 
one  again  has  both  labeled  and  unlabeled  data  with  which 
an  algorithm  is  designed,  exploiting  the  data  manifold.  Once 
this  algorithm  is  designed,  it  may  be  applied  to  the  existing 
unlabeled  data,  as  well  as  to  new  unlabeled  data,  without 
having  to  redo  the  learning  process. 

In  many  UXO-sensing  settings  all  of  the  data  are  collected 
at  once,  and  therefore  a  transductive  semi-supervised  learning 
algorithm  may  be  sufficient.  However,  if  data  is  collected 
incrementally  on  a  large  UXO-cleanup  site,  the  inductive 
framework  may  be  attractive.  The  semi-supervised  algorithm 
developed  here  is  inductive,  but  clearly  it  may  be  applied  in 
a  transductive  setting  as  a  special  case.  However,  there  are 
existing  semi-supervised  algorithms  of  interest  that  are  only 
transductive,  the  algorithm  of  Szummer  &  Jaakkola  [11]  being 
an  important  example. 

In  this  subsection  we  compare  our  results  with  perfor¬ 
mance  achieved  using  [11].  We  also  make  a  comparison  to 
results  computed  using  a  logistic-regression  classifier,  where 
the  graph  considered  here  was  as  a  prior  to  regularize  the 
learning  process  (imposing  smoothness  of  the  classifier  along 
the  data  manifold  [9]).  Like  our  algorithm,  the  approach 
in  [9]  may  operate  in  an  inductive  setting.  However,  such 
that  the  comparisons  are  fair,  for  all  examples  considered 
in  this  section,  the  unlabeled  data  on  which  classification  is 
performed  is  the  same  unlabeled  data  used  for  semi-supervised 
algorithm  learning  (consistent  with  the  requirements  of  a 
transductive  algorithm).  The  performance  is  evaluated  in  terms 
of  classification  accuracy,  defined  as  the  ratio  of  the  number 
of  correctly  classified  UXOs  and  non-UXOs  over  the  total 
number  of  data  being  used.  For  this,  a  threshold  0.5  is  used 
to  the  classification  probability.  In  the  discussion  that  follows 


the  algorithms  considered  will  be  referred  to  as  follows:  (i)  the 
method  in  [11]  is  denoted  RW-Transductive;  (ii)  the  method 
developed  in  this  paper  is  termed  RW-Inductive;  (iii)  the 
method  in  [9]  is  termed  Logistic-GRF  (for  Gaussian  random 
field  prior);  and  (iv)  the  supervised  solution  is  termed  Logistic- 
Regression,  with  this  equivalent  to  the  algorithm  in  [9]  without 
the  graphical  prior. 

To  ensure  a  fair  comparison,  we  apply  the  same  Markov 
random  walk  graph  A'-*'  with  t  =  50  and  kernel  width  at  = 
l/3minfe=i:jv  xt  —  x^.|  for  both  RW-Transductive  and  RW- 
Inductive.  Since  the  graphical  prior  for  Logistic-GRF  [9]  must 
be  symmetric,  we  symmetrize  the  graph  by  (A^  +  )/2, 

where  A^  represents  the  transpose  of  A(tl .  Denote  by  X  any 
of  the  three  data  sets  and  y  the  associated  label  set. 

For  the  results  in  Figure  1,  we  randomly  sample  A'/,  C 

X  and  assume  the  associated  label  set  3-7.  are  available. 
The  semi-supervised  algorithms  are  trained  using  X  U  U/. 
and  tested  on  X  \  A'/, .  The  supervised  algorithm  is  trained 
on  Xl  U  y i,  and  tested  on  X  \  X^.  From  Figure  1,  we 
observed  that  all  the  semi-supervised  algorithms  outperform 
the  supervised  algorithm.  In  addition,  the  RW-Inductive  con¬ 
sistently  outperforms  the  Logistic-GRF  and  is  comparable  in 
performance  to  RW-Transductive.  Further,  by  comparing  these 
three  subfigures,  we  observe  that  the  magnetometer  data  yields 
the  best  detection  results  on  the  data  considered;  this  issue  will 
be  reconsidered  when  the  labeled  data  are  chosen  actively, 
rather  than  randomly,  as  considered  in  this  example. 

C.  Detection  Results  for  Non-Active  Classification:  Inductive 

In  Figure  2  we  test  the  algorithms  in  an  inductive  mode, 
and  therefore  in  this  case  we  only  compare  RW-Inductive 
with  Logistic-GRF.  For  this  example  we  randomly  sample  200 
exmplars  Xtest  C  X  as  testing  data,  from  all  the  magnetometer 
data.  From  the  remaining  data,  we  randomly  select  a  subset 

XI  of  feature  vectors,  for  which  the  associated  labels  y f  are 
assumed  available,  and  the  remaining  feature  vectors  Xjj  are 
left  as  unlabeled.  The  semi-supervised  algorithms  are  trained 
by  using  XjJ  U  34,  U  Xjj  and  tested  on  Xtest ■  In  Figure  2 
we  observe  that  the  RW-Inductive  semi-supervised  algorithm 
performs  on  average  superior  to  Logistic-GRF,  although  the 
“error  bars”  overlap.  The  length  of  the  error  bar  is  twice  the 
standard  deviation  of  the  detection  accuracy.  Therefore,  if  the 
result  is  Gaussian  distributed,  95%  of  the  values  lie  within  the 
error  bar. 

I).  Detection  Results  with  Active  Label  Acquisition 

In  Figures  3  through  5  we  consider  active  learning  for  the 
three  data  sets  presented  in  Figure  1 ;  we  first  randomly  select 
one  UXO  and  one  non-UXO  feature  vector,  and  the  other 
labeled  data  are  selected  by  the  active  learning  algorithm; 
to  design  the  classifier  we  require  at  least  one  feature  vector 
from  each  class,  but  after  active  learning  proceeds  sufficiently 
the  large  number  of  labeled  examples  determined  adaptively 
typically  dominate  the  two  labeled  examples  with  which  we 
commence.  Compared  to  Figure  1,  we  observe  that  the  active 
learning  results  are  much  better  than  those  performed  with 
non-active  learning  (i.e.,  random  selection  of  the  labeled  data). 
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Fig.  1.  Comparison  of  semi-supervised  and  supervised  algorithms  for  the  case  in  which  all  unlabeled  data  are  available  when  algorithm  learning  is  performed. 
Each  curve  is  an  average  from  25  independent  trials.  The  horizontal  axis  is  the  size  of  X /  .  The  algorithms  are  tested  on  Xjj.  From  left  to  right  in  the  sub-figures, 
the  results  are  for  EMI  data,  magnetometer  data  and  for  both  EMI  and  magnetometer  data. 


Fig.  2.  Semi-supervised  results  on  magnetometer  data  for  the  case  in  which  different  unlabeled  data  are  used  when  testing  than  those  used  while  learning. 
Each  curve  is  an  average  from  50  independent  trials.  The  algorithms  are  tested  on  200  data  in  Xtest-  that  are  randomly  sampled  from  all  magnetometer  data 
X.  The  horizontal  axis  is  the  size  of  X;  ,  which  is  randomly  selected  from  X  \  Xtest-  Error  bars  represent  standard  deviations. 


which  demonstrates  the  effectiveness  of  the  active  learning 
schemes.  Although  all  three  semi-supervised  active-learning 
algorithms  perform  well,  the  RW-lnductive  results  appear  to  be 
best  on  average,  particularly  after  a  relatively  large  number  of 
labeled  examples  are  acquired.  Note  that  for  a  relatively  small 
number  of  excavations  for  label  acquisition,  the  magnetometer 
results  are  superior  but,  encouragingly,  as  the  number  of  labels 
acquired  extends  to  120,  the  fusion  of  the  magnetometer  and 
EMI  data  yields  slightly  improved  performance  relative  to 
either  sensor  alone. 

Further  analyzing  Figure  3,  we  observe  that  with  120  labeled 
magnetometer  examples,  these  requiring  120  evacuations  for 
the  purpose  of  learning,  the  RW-lnductive  algorithm  achieves 
an  accuracy  rate  as  high  as  96%,  but  for  the  non-active 
learning,  the  best  algorithm  performance  for  the  four  meth¬ 
ods  shown  in  Figure  1  is  only  85%.  When  both  EMI  and 
magnetometer  data  are  considered,  from  Figure  5,  we  observe 
that  with  120  labeled  data,  RW-lnductive  achieves  an  accuracy 
of  97.5%,  but  for  the  non-active  learning  results  in  Figure 
1,  RW-Transductive  and  RW-lnductive  achieve  92%  accuracy, 
and  Fogistic-GRF  and  Fogistic-Regression  only  reached  86%. 

The  results  in  Figures  3-5  present  the  final  classification 
accuracy  for  different  numbers  of  labeled  data.  In  practice  one 
will  have  to  employ  the  information-theoretic  measure  in  (25) 
to  decide  when  to  stop  excavating  for  the  purpose  of  learning, 
with  the  acquired  labels  used  to  classify  all  remaining  items. 
In  Figure  6  we  plot  the  mutual  information  of  the  next  item  to 


be  labeled  via  active  learning,  as  a  function  of  the  number  of 
items  labeled.  Results  are  shown  for  each  of  the  ten  different 
cases  considered  to  generate  the  results  in  Figures  3-5.  Note 
that  the  performance  is  relatively  independent  of  which  two 
items  were  considered  to  constitute  the  initial  labeled  UXO 
and  non-UXO  feature  vectors. 

Considering  Figure  6,  in  the  next  set  of  examples  we 
terminate  the  learning  phase  when  the  expected  gain  in  mutual 
information  is  less  than  0.1  (this  is  an  arbitrary  setting,  but 
one  observes  from  Figure  6  that  this  is  a  point  at  which  the 
subsequent  information  gains  are  relatively  small).  In  Table 
I  we  note  that  with  this  threshold  the  algorithm  consistently 
labeled  approximately  90  items,  out  of  the  possible  272  total 
items.  Interestingly,  19  of  the  UXO  excavated  in  this  active¬ 
learning  phase  were  common  among  all  of  the  ten  trials.  We 
also  reiterate  that  the  non-UXO  excavated  in  this  phase  are 
best  not  termed  “false  alarms”,  since  the  algorithm  desires  to 
learn  the  properties  of  both  the  UXO  and  non-UXO. 

Once  the  active  learning  is  completed,  a  final  classifier  is 
designed,  and  this  may  be  applied  to  the  remaining  unlabeled 
data.  In  Figure  7  we  plot  a  typical  receiver  operating  character¬ 
istic  (ROC)  curve,  which  corresponds  to  varying  the  threshold 
on  the  output  of  the  final  classifier.  We  also  place  a  circle  at 
the  point  on  the  ROC  for  which  the  threshold  is  set  to  0.5; 
the  ROC  is  computed  for  all  unlabeled  data  not  excavated  in 
the  active-learning  phase.  As  indicated,  the  results  in  Figure 
7  are  typical  of  all  of  the  ten  trials  considered  above,  but 
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Fig.  3.  Active  learning  results  for  EMI  data.  The  first  two  labeled  data  include  one  UXO  and  one  Non-UXO,  both  are  sampled  randomly.  Each  curve  is  an 
average  from  50  independent  trials.  The  horizontal  axis  is  the  size  of  X],.  The  algorithms  are  tested  on  Xjj.  Error  bars  shown  are  standard  deviations. 


Fig.  4.  Active  learning  results  for  magnetometer  data.  The  first  two  labeled  data  include  one  UXO  and  one  non-UXO,  both  are  sampled  randomly.  Each  curve 
is  an  average  from  50  independent  trials.  The  horizontal  axis  is  the  size  of  X^.  The  algorithms  are  tested  on  Xjj.  Error  bars  shown  are  standard  deviations. 


Fig.  5.  Active  learning  results  as  applied  to  EMI  and  magnetometer  data.  The  first  two  labeled  data  include  one  UXO  and  one  non-UXO,  both  are  sampled 
randomly.  Each  curve  is  an  average  from  50  independent  trials.  The  horizontal  axis  is  the  size  of  X The  algorithms  are  tested  on  Xjj.  Error  bars  shown 
are  standard  deviations. 


only  a  single  ROC  is  presented  for  ease  of  viewing.  Note  that 
the  algorithm  effectively  detects  most  of  the  UXO,  but  the 
performance  saturates  around  a  detection  probability  of  0.9, 
and  this  is  because  two  of  the  UXOs  have  features  that  are 


very  similar  to  the  non-UXO,  these  constituting  challenging 
targets  for  classification.  The  results  in  Figure  7  correspond  to 
Trail  4  in  Table  I. 
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Number  of  Labeled  Data 

Fig.  6.  Active  learning  mutual  information  as  a  function  of  the  next  item  considered  for  labeling,  as  applied  to  the  EMI  and  magnetometer  data.  Ten  different 
curves  are  considered,  for  different  random  selection  of  the  initial  UXO  and  non-UXO  labeled  examples. 


Trial 

t 

2 

3 

4 

5 

6 

7 

8 

9 

10 

Total  #  of 
labeled  data 

89 

91 

93 

95 

90 

89 

89 

95 

91 

93 

UXOs 

24 

24 

23 

26 

25 

24 

24 

25 

24 

23 

non-UXOs 

65 

67 

70 

69 

65 

65 

65 

70 

67 

70 

TABLE  I 

Summary  of  number  of  items  labeled  for  each  of  the  ten  trials  in  Fig.  6,  with  the  mutual-information-gain  threshold  set  at  0. 1 . 
Listed  are  the  number  of  UXO  and  non-UXO  excavated  in  the  active-learning  phase. 


Pfa 


Fig.  7.  Receiver  operating  characteristic  for  Trial  4  in  Table  1.  The  circle  denotes  a  threshold  of  0.5  as  applied  to  the  classifier.  The  labeled  data  were 
acquired  via  active  learning,  and  the  results  here  are  as  applied  to  the  remaining  unlabeled  data. 


V.  Conclusions 

In  this  paper  we  have  considered  the  use  of  semi-supervised 
learning  in  the  context  of  UXO  detection,  based  on  elec¬ 
tromagnetic  induction  (EMI)  and  magnetometer  data.  The 
algorithms  were  applied  to  features  extracted  from  these  data, 
with  the  features  linked  to  EMI  and  magnetometer  dipole- 
based  parametric  models.  Semi-supervised  learning  is  par¬ 
ticularly  well  suited  to  the  UXO-sensing  problem,  because 
one  typically  deploys  a  cart-based  system  to  collect  all  EMI 


and  magnetometer  data  at  once,  for  an  entire  site.  Hence, 
one  may  perform  feature  extraction  simultaneously  on  all 
buried  items  of  interest,  and  the  classification  of  any  one 
feature  vector  may  be  placed  within  the  context  of  all  feature 
vectors.  This  contextual  information  yields  information  on  the 
characteristics  of  the  data  manifold,  which  has  proven  useful 
to  improve  classification  performance  in  many  settings.  By 
contrast,  in  traditional  supervised  learning  the  labeled  data 
alone  are  employed  to  learn  a  classifier,  and  this  classifier  is 
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employed  one-by-one  to  each  unlabeled  example,  in  isolation, 
and  consequently  contextual  information  is  not  employed. 

Semi-supervised  algorithms  typically  impose  the  following 
condition:  two  feature  vectors  that  are  “close”  in  feature  space 
should  be  classified  similarly.  This  implies  that  the  classifier 
outputs  should  vary  smoothly  over  the  high-density  portion  of 
the  data  manifold,  and  consequently  that  the  decision  boundary 
in  feature  space  should  reside  in  areas  of  low  data  density. 
These  concepts  may  only  be  implemented  if  knowledge  of  the 
distribution  of  all  unlabeled  data  is  exploited  when  performing 
algorithm  learning.  The  most  advanced  semi-supervised  algo¬ 
rithms  developed  to  date  are  based  on  graphical  techniques. 
Specifically,  the  nodes  on  the  graph  correspond  to  the  feature 
vectors  (labeled  and  unlabeled),  and  the  edge  between  any 
two  feature  vectors  is  defined  by  a  distance  between  the  two 
in  feature  space,  where  here  this  is  defined  by  a  radial  basis 
function.  There  are  many  different  ways  in  which  the  graph 
may  be  employed  within  a  semi-supervised  algorithm.  For 
example,  one  may  perform  inference  directly  on  the  nodes 
of  the  graph,  thereby  inferring  labels  on  the  unlabeled  nodes. 
This  approach  does  not  generalize  to  the  classification  of 
a  general  (new)  feature  vector  that  is  not  on  the  original 
graph,  and  therefore  if  new  unlabeled  data  are  acquired,  the 
graph  must  be  reconstituted  and  learning  performed  anew. 
This  has  been  referred  to  as  transductive  semi-supervised 
learning.  By  contrast,  one  may  also  use  the  graph  to  learn  an 
“inductive”  semi-supervised  algorithm,  which  may  be  applied 
to  new  unlabeled  data  without  having  to  reconstitute  the  graph 
or  relearn.  In  the  work  presented  here  we  have  developed 
a  new  inductive  semi-supervised  algorithm,  which  extends 
the  transductive  algorithm  developed  in  [11],  We  have  also 
performed  comparisons  to  another  (distinct)  inductive  semi- 
supervised  algorithm  [9],  as  well  as  to  supervised  learning. 
We  have  demonstrated  that  for  the  measured  UXO-sensing 
data  considered  here,  from  an  actual  UXO  cleanup  site,  that 
the  semi-supervised  algorithms  perform  better  than  purely  su¬ 
pervised  learning,  implying  that  there  is  value  in  the  manifold 
information  associated  with  UXO  sensing,  at  least  for  the 
UXO  site  considered. 

In  the  UXO-excavation  problem,  clearly  there  will  be  many 
items  manually  removed,  and  the  cost  of  unnecessary  excava¬ 
tion  of  non-UXO  items  often  constitutes  the  principal  cleanup 
cost.  One  may  therefore  ask  whether  initial  excavation  may  be 
performed  with  the  purpose  of  learning.  This  is  termed  active 
learning,  and  is  characterized  by  asking  in  an  information- 
theoretic  sense  which  unlabeled  feature  vectors  would  be  most 
informative  for  improving  the  classifier  if  the  associated  label 
could  be  acquired  (here  implemented  via  targeted  excavation). 
In  this  sense  the  algorithm  learns  adaptively,  directly  on  the 
site  under  test.  In  the  examples  considered  here  active  learning 
yielded  substantial  improvement  in  UXO-classification  perfor¬ 
mance,  relative  to  selecting  the  labeled  data  randomly.  One 
limitation  of  the  active-learning  framework,  as  implemented, 
is  that  to  commence  one  needs  at  least  one  UXO  and  one 
non-UXO  labeled  example.  In  practice  this  is  often  not  a 
significant  limitation,  because  one  typically  knows  the  type  of 
UXO  that  may  be  encountered  at  a  given  site  (from  historical 
information,  and  also  from  the  items  observed  on  the  surface). 


and  an  archive  of  existing  labeled  UXO  data  may  be  used  for 
this  target  class.  Further,  since  at  a  typical  site  the  quantity  of 
non-UXO  is  much  larger  than  the  number  of  UXO,  almost  any 
initial  excavations  will  yield  at  least  one  non-UXO  signature. 
In  the  results  presented  here  we  examined  the  sensitivity  of  the 
algorithm  to  the  initial  UXO  and  non-UXO  labeled  exemplars, 
and  found  the  algorithm  to  be  robust  in  practice. 

For  the  UXO-sensing  data  considered,  we  observed  a  sub¬ 
stantial  gain  in  the  performance  of  the  semi-supervised  algo¬ 
rithm  developed  here  relative  to  a  corresponding  supervised- 
learning  algorithm.  However,  for  the  semi-supervised  algo¬ 
rithm,  the  performance  of  learning  using  active-learning- 
determined  labeled  data  was  only  slightly  better  to  learning 
with  randomly  selected  labeled  data  (the  latter  still  using  semi- 
supervised  learning).  This  is  a  phenomenon  we  have  observed 
on  several  different  data  sets:  Since  the  semi-supervised  al¬ 
gorithm  exploits  the  information  in  the  entire  data  manifold, 
using  labeled  and  unlabeled  data,  we  have  found  in  practice 
that  it  is  less  sensitive  to  exactly  which  labeled  data  are 
considered;  by  contrast,  when  employing  supervised  learning 
the  particular  labeled  data  considered  is  often  of  significant 
importance  [5]. 

The  most  significant  direction  for  future  research  involves 
appropriate  design  of  the  graph  for  UXO  applications.  The 
weights  on  the  graph  edges  are  adapted  to  the  characteristics  of 
the  manifold,  via  the  data-dependent  variance  in  (1).  However, 
in  the  analysis  that  followed  a  f-step  walk  on  the  graph  was 
employed,  where  in  the  examples  considered  here  t=50.  The 
size  of  t  plays  an  important  role  in  defining  what  it  means 
for  two  feature  vectors  to  be  “close”  in  feature  space.  It 
is  of  interest  to  develop  a  principled  means  of  defining  an 
appropriate  t  for  a  given  data  set.  We  note  that  the  use  of 
1=50  was  not  carefully  tuned  for  the  data  considered  here, 
and  many  similar  values  (20  <  t  <  80)  yielded  similar  results 
on  the  Badlands  UXO  data.  We  also  note  that  the  need  to 
develop  a  technique  for  selecting  t  is  not  unique  to  the  RW- 
Inductive  algorithm  introduced  here,  but  is  of  interest  for  any 
of  the  graph-based  semi-supervised  algorithms. 
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Abstract 


We  address  the  problem  of  unexploded  ordnance  (UXO)  detection  in  which  data  to  be  classified  are 
available  from  multiple  sensor  modalities  and  multiple  resolutions.  Specifically,  features  are  extracted 
from  measured  magnetometer  and  electromagnetic  induction  (EMI)  data;  multiple  resolution  data  is 
manifested  when  the  sensors  are  separated  from  the  buried  targets  of  interest  by  different  distances  (e.g., 
different  sensor-platform  heights).  The  proposed  classification  algorithm  explicitly  emphasizes  features 
extracted  from  fine-resolution  imagery  over  those  extracted  from  less  reliable,  coarse-resolution  data. 
When  fine-resolution  features  are  unavailable  (due  to  undeployed  sensors),  the  algorithm  analytically 
integrates  out  the  missing  features  via  an  estimated  conditional  density  function,  conditioned  on  the 
observed  features  (from  deployed  sensors).  This  density  function  exploits  the  statistical  relationships 
that  exist  among  features  at  different  resolutions,  as  well  as  those  among  features  from  different  sensors 
(in  the  multi-sensor  case).  Experimental  classification  results  are  shown  for  real  UXO  data,  on  which 
the  proposed  algorithm  consistently  achieves  better  classification  performance  than  common  alternative 
approaches. 


I.  Introduction 

The  problem  of  unexploded  ordnance  (UXO)  classification  continues  to  receive  significant 
attention  in  the  scientific  community  [  1  ]— [5] .  The  objective  of  a  UXO  detection  (or  classification) 
task  is  to  distinguish  buried  UXO  targets  from  non-UXO  targets  (i.e.,  clutter).  To  this  end,  sensors 
are  used  to  measure  data  (e.g.,  magnetic  fields)  over  a  two-dimensional  grid.  These  (raw  data) 
sensor  measurements  can  therefore  be  considered  to  be  in  the  form  of  an  image.  The  ultimate 
classification  task  is  then  performed  using  features  extracted  from  this  imagery.  This  paper 
addresses  the  problem  of  multi-sensor  UXO  detection  when  magnetometer  and  electromagnetic 
induction  (EMI)  sensors  are  employed.  In  particular,  we  address  the  realistic  case  in  which 
imagery  from  each  of  these  sensor  modalities  may  be  available  from  multiple  resolutions. 
Throughout  this  paper,  the  term  “resolution”  refers  to  the  amount  of  spatial  detail  attainable 
in  an  image,  with  this  quantity  being  inversely  proportional  to  the  distance  between  the  sensor 
and  the  targets  (or  ground). 

Magnetometer  and  electromagnetic  induction  (EMI)  constitute  the  principal  sensors  used  in 
UXO  detection  and  classification  [6]— [8].  It  is  a  time-consuming  task  to  deploy  these  sensors 
over  the  large  domains  that  must  be  interrogated  for  possible  buried  UXO.  There  has  therefore 
been  interest  in  deploying  these  sensors  on  helicopters,  thereby  accelerating  the  data-collection 
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process.  Although  helicopter-borne  sensors  afford  increased  collection  speed,  they  also  incur  the 
deleterious  effect  of  a  loss  of  signal  strength.  In  particular,  for  a  distance  r  between  the  sensor 
and  target,  magnetometers  and  EMI  sensors  measure  field  strengths  proportional  to  1/r3  and 
1/r6,  respectively.  Therefore,  the  increased  sensor  height  required  by  the  helicopter  manifests  a 
significant  loss  of  signal  strength,  which  undermines  the  ability  to  detect  small  or  deeply-buried 
UXO.  In  practice,  therefore,  one  may  be  interested  in  deploying  a  helicopter-bome  sensor  over 
as  large  a  region  as  possible,  with  ground-based  sensors  applied  only  locally,  over  a  coarse  set 
of  lines  running  through  the  site  under  test. 

This  paper  addresses  the  problem  of  classification  for  data  sets  in  which  the  features  of 
different  data  points  are  extracted  from  sensor  imagery  at  different  resolutions.  Additionally,  this 
work  considers  the  more  general  case  in  which  multiple  sensor  modalities  —  each  of  which  may 
operate  at  multiple  resolutions  —  are  employed.  In  the  multi-sensor  scenario,  incomplete  data 
is  manifested  when  some  data  points  are  interrogated  by  only  a  subset  of  the  available  sensors. 
Incomplete  data  also  exists  in  the  single-sensor  case  when  not  all  data  points  have  features 
extracted  from  imagery  at  all  resolution  levels.  Although  the  classification  algorithm  introduced 
in  this  paper  is  applicable  for  data  sets  that  fit  the  general  multi-resolution  framework,  we  focus 
specifically  on  the  problem  of  UXO  detection.  In  summary,  the  novel  problem  we  address  in 
this  paper  is  of  multi-sensor,  multi-resolution,  incomplete-data  UXO  classification. 

It  is  important  to  emphasize  that  this  paper  addresses  a  multi-resolution  classification  problem 
that  has  not  been  examined  previously  (see  [9]  for  a  thorough  review  of  multi-resolution  work). 
In  most  previous  “multi-resolution”  image  classification  work  [10],  [11],  the  original  imagery 
actually  exists  at  only  a  single  resolution;  the  term  “multi-resolution”  refers  simply  to  a  wavelet 
or  other  multi-resolution  decomposition  [12]  of  the  original  single  resolution  imagery.  In  contrast, 
this  paper  utilizes  multiple  raw  images,  each  at  a  unique  resolution.  The  ultimate  classification 
objective  also  distinguishes  this  work  from  other  multi-resolution  image  classification  work.  Most 
multi-resolution  classification  work  strives  for  pixel-level  classification  via  image  segmentation 
[13]— [15].  In  contrast,  in  this  work,  a  given  image  belongs  to  a  single  class  (UXO  or  non-UXO). 

Several  approaches  can  be  employed  to  handle  the  missing-data  problem  in  which  some 
data  points  are  characterized  by  features  extracted  from  only  a  subset  of  the  possible  sensors 
and/or  resolution  levels.  One  approach  would  build  a  separate  classifier  for  each  type  of  data. 
Assuming  the  set  of  possible  data  is  relatively  small,  this  approach  would  be  reasonable.  The 
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major  drawback  with  this  method,  however,  is  that  the  dependencies  between  different  types 
of  data  are  not  exploited.  In  addition  to  ignoring  the  correlations  between  sensors,  the  severe 
fragmentation  of  the  data  set  —  based  on  the  combinations  of  which  sensors  and  resolutions  are 
observed  —  may  leave  insufficient  data  to  train  each  classifier. 

A  different  method  would  concatenate  the  features  from  the  various  resolutions;  incomplete 
data  arising  from  missing  sensors  and/or  resolutions  would  be  handled  in  some  way,  such  as 
by  imputation  [16].  However,  such  an  approach  would  treat  features  obtained  from  images  at 
different  resolutions  equally.  Intuitively,  one  should  favor  using  features  extracted  from  high- 
resolution  imagery. 

The  algorithm  proposed  in  this  paper  extends  the  work  in  [17]  —  in  which  missing  data  is 
analytically  integrated  out  —  to  the  case  of  multi-resolution  imagery.  The  algorithm,  which  does 
not  suffer  from  any  of  the  drawbacks  that  plague  the  aforementioned  methods,  requires  only  a 
single  classifier,  regardless  of  the  number  of  sensors  or  the  number  of  resolutions  involved  in  the 
problem.  Moreover,  all  data  are  utilized,  so  correlations  among  sensors,  as  well  as  among  features 
at  different  resolutions,  are  exploited.  Additionally,  features  extracted  from  different  resolutions 
are  not  treated  equally;  rather,  fine-resolution  features  are  given  more  importance.  Furthermore, 
the  missing  data  that  exist  are  handled  in  a  principled  manner,  avoiding  explicit  imputation. 
Specifically,  the  missing  data  are  integrated  out  via  the  use  of  an  estimated  conditional  density 
function  that  relates  the  dependencies  of  features  both  of  a  single  given  sensor  at  different 
resolutions,  as  well  as  of  features  from  different  sensors. 

The  remainder  of  this  paper  is  organized  as  follows.  Section  II  explains  notation  necessary 
for  the  proposed  classification  algorithm  introduced  in  Section  III.  Section  IV  describes  the 
UXO  model  inversion  (and  feature  extraction)  processes.  Experimental  classification  results  are 
shown  in  Section  V.  Section  VI  consists  of  a  discussion,  followed  in  Section  VII  by  concluding 
comments  and  directions  for  future  work. 

II.  Notation 

Consider  the  case  in  which  a  sensor  generates  raw  data  in  the  form  of  an  image,  from  which 
features  are  extracted  subsequently.  Assume  we  possess  S  such  sensors,  the  s-th  of  which  can 
operate  at  Rs  +  1  resolutions;  the  resolution  is  a  function  of  the  distance  between  the  sensor  and 
the  ground  under  which  the  targets  are  buried.  Each  of  the  S  sensors  may  or  may  not  be  of  the 
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same  modality,  and  the  possible  resolutions  of  each  sensor  are  in  general  unique.  Define  A*  to 
be  the  r-th  sensor-target  separation  distance  (hereafter,  simply  “separation  distance”)  of  the  s-th 
sensor,  for  s  —  1,  2, . . . ,  S  and  r  —  0, 1,  2, ,  Rs.  Let  Ag  denote  the  smallest  separation  distance 
of  the  s-th  sensor.  The  resolution  of  an  image,  which  is  inversely  proportional  to  the  separation 
distance,  is  written  TZ(-).  The  image  that  results  from  operating  a  sensor  at  its  smallest  separation 
distance  is  referred  to  as  a  fine-resolution  image.  Sensors  operating  at  larger  separation  distances 
generate  coarse-resolution  imagery. 

Assume  that  for  a  given  sensor,  the  type  of  features  extracted  from  the  raw-image  data  are 
fixed,  regardless  of  the  separation  distance  of  the  sensor  that  generated  the  data.  That  is,  for  a 
given  sensor,  the  specific  features  extracted  will  be  identical  for  all  separation  distances,  but  the 
feature  values  will  in  general  be  unique  for  each  separation  distance. 

Let  }  e  MFs  be  the  Fs  features  of  the  s-th  sensor  for  the  i-th  item  (/.<?.,  object,  which  may 
be  UXO  or  non-UXO),  extracted  from  data  corresponding  to  the  highest  resolution  image  of  the 
s-th  sensor.  For  all  larger  separation  distances,  let  zf'r)  e  MFs  be  the  Fs  features  of  the  s-th 
sensor  for  the  i-th  item,  extracted  from  the  image  obtained  with  the  s-th  sensor  operating  at  the 
r-th  separation  distance.  Define  xt  =  xf  \  xf\  . . . ,  xf  ]  to  be  the  concatenated  feature  vectors 
extracted  from  imagery  at  each  sensor’s  respective  smallest  separation  distance.  Similarly,  define 
Zi  =  zf\z\2\ . . . ,  zf]  to  be  the  concatenated  feature  vectors  extracted  from  each  sensor’s 
coarse-resolution  imagery,  where  z^  =  z\s,2\  . . . ,  z[s,Ra'>  .  Hereafter,  we  shall  refer  to 

x,  and  Zi  as  primary  and  auxiliary  features  (or  data),  respectively. 

The  data  can  alternatively  be  partitioned  in  terms  of  its  observed  and  missing  components. 
Let  o f  be  the  set  of  sensors  for  which  the  i-th  item’s  primary  features  are  observed.  Let  mf  be 
the  (complementary)  set  of  sensors  for  which  the  primary  features  are  missing  for  the  i-th  item. 
Similarly,  let  of  be  the  set  of  sensor  and  separation-distance  pairs  for  which  the  auxiliary  features 
for  the  i-th  item  are  observed.  Let  mf  be  the  (complementary)  set  of  sensor  and  separation- 
distance  pairs  for  which  the  auxiliary  features  for  the  i-th  item  are  missing.  To  simplify  notation, 
we  shall  suppress  the  superscripts  when  doing  so  will  not  cause  confusion  (e.g.,  x^ ,  the  primary 
features  (from  all  sensors)  that  are  observed  for  the  i-th  item,  will  be  written  as  x°').  The  primary 
and  auxiliary  data  of  the  i-th  item  can  thus  be  written  as  x ;  =  [ccf*;  cc”1*]  and  Zi  =  [z°*',  z™*], 
respectively. 

Data  for  a  given  item  is  deemed  to  be  complete  if  we  possess  all  primary  features,  for  all 
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sensors,  for  that  data  point  ( i.e .,  m-  =  0).  A  data  point  is  otherwise  deemed  incomplete .  It  should 
be  noted  that  there  exist  two  different  types  of  incomplete  data.  First,  data  would  be  incomplete 
if  some  subset  of  sensors  were  never  deployed  (at  any  separation  distance)  for  the  corresponding 
item  (UXO  or  non-UXO).  Data  could  also  be  incomplete  even  when  all  sensors  were  deployed 
for  the  item;  specifically,  the  data  would  still  be  considered  incomplete  in  this  case  if  the  data 
had  not  been  interrogated  at  the  smallest  separation  distance  of  every  sensor. 


III.  Classification  with  Incomplete  Data 
Assume  we  have  a  set  of  labeled  (incomplete)  data 

VL  =  {*<,  zir Vi ,  €i,  of,  of,  mf ,  mf 


(1) 


where  e  {  —  1, 1}  is  the  label  (indicating  non-UXO  or  UXO,  respectively)  of  the  i-th  item, 
and  6j  G  [0,  0.5)  is  the  corresponding  labeling  error  rate.  The  labeling  error  rate  is  simply  the 
probability  that  a  true  label  was  flipped  (corrupted)  to  the  wrong  label  (e.g.,  {yfue  =  1}  — > 
{yi  =  —1}).  Such  imperfect  labels  can  be  manifested  when  a  human  analyst  performs  the 
labeling  without  excavating  the  buried  object. 

Let  ws  =  w^\  wf\ wiFa^  represent  the  Fs  weights  of  a  classifier  on  the  primary 
features  of  the  s-th  sensor.  Let  w  =  [wh,  w2, . . . ,  ws]  be  the  classifier  weights  on  the  primary 
features  of  each  sensor  (i.e.,  x,).  Note  that  the  number  of  features  from  each  sensor  need  not 
be  identical.  It  must  be  emphasized  that  the  weights  —  and  hence  the  resulting  classifier  —  are 
on  the  features  extracted  from  only  the  fine -resolution  imagery.  We  emphasize  this  caveat  by 
using  different  notation  for  primary  features  extracted  from  the  fine-resolution  imagery  (x,)  and 
auxiliary  features  extracted  from  coarse-resolution  imagery  (z,). 

In  logistic  regression  (with  a  hyperplane  classifier),  the  probability  of  label  yi  given  feature 
vector  xt  is  p{yi\x^w)  =  cr(yiwTxi),  where  a(rj)  =  (1  +  exp(— y))-1  is  the  sigmoid  link 
function  and  w  constitutes  a  classifier.  Accounting  for  imperfections  in  the  labeling  process 
arising  from  a  known  labeling  error  rate  et,  the  probability  of  label  yi  given  x%  and  c,  is  [18] 

p  ( Vi\xi ,  eh  w)  =  Ci  +  (1  -  2 ei)o-(yiWTXi).  (2) 


Note  that  the  standard  case  of  perfect  labels  is  recovered  when  et  =  0.  If  all  features  are 
extracted  from  complete  data,  the  weights  of  the  classifier  can  be  learned  easily  by  maximizing 
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the  likelihood  of  the  data.  Here  we  consider  the  case  in  which  the  data  are  in  general  incomplete 
in  the  sense  described  above. 

Recall  that  the  classifier  is  to  be  designed  for  only  the  primary  data  —  the  features  extracted 
from  the  finest-resolution  imagery.  We  first  partition  x ,  into  its  observed  and  missing  parts, 
Xi  =  [x°i\x™'i],  and  then  apply  the  same  partition  to  w  to  obtain  w  =  [w0i;wmi].  With 
ry  =  wJnxrJl\  (2)  can  be  written  as 

P  ( Vi\x eii  w)  =  €i  +  (  1  -  2ei)o(yi(w^x°ii  +  ip)).  (3) 


If  the  missing  data  x’-1'  is  integrated  out,  the  needed  probability  of  y,  given  all  observed  features 
can  be  written  as 


p(yi\xi%z°7e^w)  =  p(yi\x0ii,ei,w)p(x™i\x°i,z0ii)dxr[li 


(4) 


=  ei  +  (1  -  2 e»)  /  a(yi(w0.x°i  +  rji))p  (rji ZT)  drH-  (5) 


Although  the  classifier  uses  only  the  primary  data,  the  auxiliary  data  is  exploited  when  primary 
data  is  missing,  via  the  conditional  density  function  p  (x™l\  x°*,  z°'  ).  That  is,  when  primary  data 
is  available,  it  is  utilized;  when  primary  data  is  unavailable,  the  auxiliary  data  becomes  relevant 
and  is  exploited. 

The  integration  in  (5)  can  be  performed  analytically  by  making  two  mild  assumptions.  First, 
we  assume  that  p  (x,.  z,;)  is  a  Gaussian  mixture  model  (GMM),  which  can  accurately  model 
many  reasonably  well-behaved  distributions.  This  density  function  describes  the  relationships 
among  the  same  features  obtained  from  different  resolutions  of  a  given  sensor;  it  also  describes 
the  relationships  among  features  from  different  sensor  modalities.  It  then  follows  that 


p(x.u  Zi)  =  p{x^\  x°\  z?)p(z?<  \xf\  X?,  z°P 


(6) 


where  p(x™\  x°\  zf)  is  also  necessarily  a  GMM.  Introducing  the  notation  xT  —  [x°\  z°%  this 
(K- -component)  GMM  is 


K 

fc=i 


^nii 

X ■  * 


X 


Oi 


p°k 


V yimmi 

^ k  ^ k 


^ k 


Oi  Oi 

^ k 


(7) 


where  tt/,  are  the  non-negative  mixing  proportions  that  sum  to  unity.  Moreover,  p{xr^,\x°i)  is  a 
GMM  as  well.  Because  of  the  linear  relation  ry  =  wj^x™"'- ,  p  (ry  \ x7 )  is  also  a  GMM, 
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Fig.  1.  Illustration  of  the  accuracy  of  the  approximation  made  between  the  logistic  function  and  the  (scaled)  probit  function. 


with  the  parameters 


(9) 

(10) 

_  ^ minii  _  ^rriiOi  r-^0i0i\—l-^0imi 

k  k  V  k  )  k 

(11) 

where  G(Vi)  —  (27t)-1/2  exp  {— r/f/2}  is  the  standard  univariate  Gaussian  density  function  with 
zero  mean  and  unit  variance.  The  requisite  GMM  density  function  estimation  can  be  accurately 
performed  using  all  available  data,  via  the  Variational  Bayesian  Expectation-Maximization  algo¬ 
rithm  presented  in  [17]. 

The  second  (very  accurate)  assumption  is  that  the  sigmoid  function  can  be  approximated  as 
a  probit  function  (i.e.,  a  Gaussian  cumulative  distribution  function) 


cr(a) 


g\j3\du 


(12) 


where  (3  =  -%.  The  accuracy  of  this  approximation  is  shown  in  Figure  1. 

Mirroring  the  derivation  in  [17],  it  can  then  be  shown  that  the  integral  in  (5)  can  be  computed 
analytically.  The  result  of  this  integration  is  that  the  probability  of  y \  given  only  the  observed 
portions  of  Xi  and  z,  can  be  expressed  as  a  mixture  of  sigmoids: 
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The  log-likelihood  function  of  the  incomplete  data  in  (1)  is  then 


£(m)  =  logp 

'viPlvltf 


Nl 


J^iog 
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(14) 

(15) 


The  objective  function  (15)  to  be  maximized  is  no  longer  concave  for  two  reasons.  First,  the 
concavity  is  destroyed  by  the  imperfect  labels  resulting  from  e*.  Even  in  the  case  of  perfect 
labels  though,  (15)  is  not  concave  because  of  the  particular  form  of  the  argument  of  the  sigmoid 
function,  arising  from  the  incomplete  data.  Since  (15)  is  not  concave,  an  intelligent  initialization 
of  w  is  valuable  for  avoiding  local  maxima.  We  therefore  initialize  w  as  follows.  We  “complete” 
the  data  set  by  replacing  the  missing  features  x with  the  conditional  mean  Ex'"  x"1]  = 
Ylk= i  W  ’  where  Slk  and  are  defined  in  (9)  and  (10),  respectively.  For  the  initialization,  we 
also  treat  all  labels  as  perfect,  artificially  setting  all  e*  =  0.  This  “completed,”  “perfectly”  labeled 
data  set  is  submitted  to  the  standard  logistic  regression  [19]  to  obtain  w0,  which  is  then  used 
as  the  initialization  of  w  in  maximizing  (15)  by  a  modified  form  of  gradient  ascent  (additional 
details  are  shown  in  the  Appendix).  Empirical  evidence  [20]  suggests  that  this  initialization 
successfully  avoids  most  local  maxima. 

Thus,  the  maximum-likelihood  (ML)  logistic  regression  classifier  w  can  then  be  obtained,  in 
spite  of  the  missing  data  (and  imperfect  labels).  Thereafter,  the  class  predictions  of  an  unlabeled 
testing  data  point  with  incomplete  (missing)  features  is  computed  trivially  using  (13)  (with  =  0 
since  no  actual  labeling  will  have  transpired). 


IV.  Multi-Sensor  Multiresolution  UXO  Data 

The  proposed  classification  algorithm  is  designed  for  data  sets  consisting  of  imagery  that 
exists  at  multiple  resolutions  for  multiple  sensors,  which  is  a  realistic  scenario  for  UXO  detection 
tasks.  Here  we  consider  the  case  in  which  we  possess  two  modalities,  magnetometer  and  EMI. 
Moreover,  it  is  assumed  that  each  sensor  can  operate  at  two  different  image  resolution  levels. 
The  image  resolution  is  a  function  of  the  distance  between  the  sensor  and  the  (buried)  target. 
Therefore,  these  unique  image  resolutions  are  manifested  by  deploying  a  given  sensor  on  different 
platforms  (at  different  heights). 

Coarse-resolution  imagery  is  generated  by  a  given  sensor  when  a  relatively  large  distance 
separates  the  sensor  from  the  targets;  here,  this  situation  corresponds  to  the  case  in  which  the 
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sensor  collects  data  while  located  on  a  low-altitude  airborne  platform  (e.g.,  a  helicopter)  that 
flies  above  the  area  of  interest.  In  contrast,  fine-resolution  imagery  is  generated  by  a  given  sensor 
when  a  smaller  distance  separates  the  sensor  from  the  targets;  here,  this  situation  corresponds 
to  the  case  in  which  the  sensor  collects  data  while  located  on  a  ground-based  platform.  Because 
the  magnetometer  and  EMI  sensor  may  not  be  located  on  the  same  platform,  each  sensor  may 
interrogate  unique  areas  of  land  that  overlap  only  partially.  As  a  result,  some  targets  may  be 
characterized  by  imagery  from  only  one  of  the  sensors,  which  is  a  case  of  incomplete  data. 

A.  Feature  Extraction  Models 

The  magnetometer  and  EMI  sensor  data  used  in  this  study  are  magnetic  field  measurements 
as  a  function  of  spatial  position.  The  data  from  each  sensor  can  therefore  be  considered  to  be 
in  the  format  of  an  image.  Features  are  extracted  from  this  imagery  and  then  subsequently  used 
in  the  classification  stage  discussed  in  Section  III.  The  features  we  use  here  are  the  parameters 
of  UXO  models  developed  in  [8]  that  are  fit  via  a  model  inversion  process.  Specifically,  the 
measured  (image)  data  is  the  input  to  the  inversion,  and  the  model  parameters  (features)  are  the 
output  of  the  inversion.  We  subsequently  employ  these  fitted  model  parameters  as  the  features  of 
the  classification  stage.  As  a  result,  regardless  of  the  resolution  of  the  image,  the  same  features 
(parameters)  are  extracted.  Although  the  features  are  identical,  the  actual  values  of  these  features 
extracted  from  images  at  different  resolutions  (for  any  given  data  point)  will  be  unique. 

1 )  Magnetometer  Model:  Ferrous  objects  cause  changes  in  the  observed  background  magnetic 
field  of  the  earth;  magnetometers  sense  these  changes.  It  has  been  shown  that  the  spatially 
dependent  magnetometer  signal  is  well-modeled  by  a  simple  magnetic  dipole  [21].  The  success 
of  this  magnetic-dipole  model  for  sensing  buried  UXO  [6],  [7],  [22],  [23]  motivates  us  to  employ 
the  model  for  the  measured  spatially  dependent  magnetometer  data  here. 

In  the  x,  y,  and  2  coordinate  system  of  the  sensor,  let  the  ^-direction  be  normal  to  the  air-soil 
interface.  Let  the  position  vectors  of  the  sensor  (i.e.,  observation  point)  and  target-dipole  be 
r.s  =  xs  ys  zs  and  rt  =  x,  y,  zt  ,  respectively.  Define  rts  =  rs  —  rt  to  be  the  vector 
directed  from  the  dipole  to  the  sensor,  with  r  =  the  corresponding  unit  vector.  It  should  also 
be  noted  that  the  orientation  of  the  magnetic -dipole  —  completely  summarized  by  the  angles  6 
and  <f>  —  is  different  from  the  direction  of  the  ordnance  itself. 
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When  the  sensor  is  sufficiently  distant  from  the  buried  target  relative  to  the  target  dimensions, 
the  (vector)  magnetic  field  may  be  represented  approximately  as  [8],  [24] 

1  m  ■  r 


H 


(16) 


2vr  |rts|3  ’ 

where  m  is  the  magnetic-dipole  moment.  The  magnetometer  employed  to  collect  the  data  used 
in  this  work  measures  the  ^-component  of  the  magnetic  field  as  a  function  of  position  on  the 
surface.  This  measurement  is  subsequently  fit  to  the  model  in  (16)  via  a  simple  gradient  search. 
Specifically,  the  parameters  that  the  model  inversion  fits  are  the  target  position  ( x ,  y,  and  depth 
z),  the  magnetic-dipole  strength,  ( m  =  |m|),  and  the  magnetic -dipole  orientation  {6  and  0).  We 
retain  the  last  four  parameters  (z,  m,  6,  and  0)  of  the  model  as  features  for  the  classification 
stage. 

2 )  EMI  Model:  A  model  for  the  EMI  response  of  targets  that  generalizes  the  magnetometer 
model  via  a  frequency-dependent  magnetic  dipole  has  been  developed  in  [8].  Specifically,  the 
magnetic-dipole  moment  m  of  a  target  is  represented  as 


m  =  MH1 


(17) 


where  Hinc  denotes  the  incident  (excitation)  magnetic  field,  and  M  is  the  magnetization  tensor 
that  relates  the  magnetic  field  to  the  magnetic -dipole  moment.  For  a  UXO  assumed  to  be 
rotationally  symmetric  with  the  axis  of  rotation  along  the  z  direction,  the  (frequency-dependent) 
magnetization  tensor  can  be  expressed  as  a  diagonal  matrix  [22] 
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diag 


mP0  +  E* 


LxJTflpi 
u-jujpi  ’ 


mP0  +  Ei 


LOTTlpi 

u-jupi  i 


mz0  +  E, 
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(18) 


The  terms  rnz0  and  mp0  correspond  to  the  zero-frequency  magnetic-dipole  moments  of  the  target, 
directed  perpendicular  to  and  along  the  target’s  axis  of  rotation,  respectively.  The  terms  rrizj.  and 
rripi  in  (18)  account  for  the  frequency- dependent  character  of  the  response,  while  ojzk  and  upi 
correspond  to  EMI  resonant  frequencies.  Because  higher  order  dipole  moments  in  the  summations 
in  (18)  typically  lack  significant  strength  [25],  here  we  use  only  the  first  term  in  each  summation, 
which  is  representative  of  the  principal  dipole  mode  along  each  of  the  principal  axes. 

If  it  is  assumed  that  the  EMI  source  responsible  for  the  excitation  magnetic  field  Hmc  can  be 
represented  —  as  seen  from  the  target  —  as  a  magnetic  dipole  with  moment  ms,  then  [8] 


Hmc 


1  ms  ■  r 
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where  rst  is  the  vector  directed  from  the  source  to  the  target  center,  with  r  =  the 
corresponding  unit  vector.  Assuming  sufficient  proximity  of  the  sensor’s  source  and  receiver 
coils,  the  total  (frequency-dependent)  magnetic  field  observed  at  the  sensor  will  be  [8] 

Hree  oc  — ^-rTUrMUr,  (20) 

\rst  |6 

where  the  proportionality  constant  depends  on  the  strength  of  the  dipole  source  ms  and  the 
characteristics  of  the  receiver. 

The  3  x  3  unitary  rotation  matrix  U  rotates  the  fields  from  the  coordinate  system  of  the  sensor 
to  the  coordinate  system  of  the  target,  and  U7  transforms  the  dipole  fields  of  the  target  (in  the  M 
coordinate  system)  back  to  the  coordinate  system  of  the  sensor.  Explicitly,  the  target  orientation, 
in  terms  of  the  angles  of  the  target  6  and  0  with  respect  to  the  sensor  coordinate  system,  is 
accounted  for  by 

cos  0  0  sin  0  cos  6  sin  0  0 

U=  0  10  —  sin#  cos0  0  •  (21) 

—  sin  0  0  cos  0  0  0  1 

As  with  the  magnetometer,  the  EMI  sensor  employed  in  this  work  measures  the  ^-component 
of  the  magnetic  field  as  a  function  of  position  on  the  surface.  This  measurement  is  subsequently 
fit  to  the  model  in  (20)  via  a  form  of  the  Levenberg-Marquardt  method  [26].  Specifically,  the 
parameters  that  the  model  inversion  fits  are  the  target  position  (x,  y,  and  depth  z),  the  target 
orientation  ( 6  and  0),  the  magnetic  dipole  strengths  (■ mz0 ,  mp0,  mzk,  and  rnpi),  and  the  EMI 
resonant  frequencies  (c uzk  and  upi).  We  retain  five  parameters  (z,  6,  0,  mz0,  and  mp0)  of  the 
model  as  features  for  the  classification  stage. 

In  fitting  the  more  sophisticated  EMI  model,  parameters  from  the  magnetometer  inversion  are 
used  to  constrain  the  search  of  some  of  the  EMI  model  parameters.  Specifically,  the  depth  (z) 
and  cross-sectional  position  (x  and  y)  of  the  target  found  by  the  magnetometer  inversion  are 
used  to  initialize  the  target  location  in  the  EMI  inversion.  This  initialization  helps  avoid  some 
local  maxima  in  the  inversion  process.  To  overcome  other  local  maxima,  several  (model-fitting) 
solutions  are  obtained,  with  each  solution  resulting  from  randomly  initializing  the  remaining 
parameters  of  the  model.  The  final  parameters  of  the  model  are  taken  to  be  those  of  the  solution 
that  minimizes  the  mean-square  error  between  the  measured  and  model-fit  data. 
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B.  Simulation  of  Multiresolution  Imagery 

We  possess  measured  magnetometer  and  EMI  (image)  data  measured  by  ground-based  sensors; 
we  simulate  multi-resolution  imagery  from  the  available  single-resolution  imagery  in  the  follow¬ 
ing  manner.  The  image-simulation  process  for  the  two  sensors  is  identical,  so  here  we  explain 
the  process  in  terms  of  the  magnetometer  sensor.  During  the  explanation,  we  shall  reference 
Figure  2,  which  illustrates  the  various  stages  of  the  process  for  one  example  target. 

We  begin  with  ground-based  magnetometer  data  measurements  for  a  target  (Figure  2(A)). 
The  magnetometer  model  inversion  explained  in  Section  IV  is  performed,  which  provides  model 
parameters.  These  model  parameters  are  then  assumed  to  be  the  true  model  parameters  of  the 
target.  Using  these  model  parameters,  ground-based  data  can  be  synthesized  (Figure  2(B)).  If 
the  model  fitting  was  successful,  the  measured  and  synthesized  data  should  be  nearly  identical. 
Using  these  same  model  parameters,  one  can  instead  synthesize  coarse-resolution  (helicopter- 
based)  data  (Figure  2(C))  by  increasing  the  value  representing  the  distance  between  the  sensor 
and  the  ground.  The  distance  from  the  ground-based  sensors  to  the  ground  surface  is  0.3  m, 
while  the  distance  from  the  helicopter-based  sensors  to  the  ground  surface  is  assumed  to  be 
5.0  m. 

As  stated  until  now,  this  synthesis  procedure  would  be  unrealistic  because  the  sensor  noise 
of  the  helicopter-based  sensor  should  be  higher  than  that  of  the  ground-based  sensor.  To  reflect 
this  fact,  white  Gaussian  noise  (J\f( 0,  a2),  where  here  a  =  1)  —  representing  sensor  noise  —  is 
added  to  this  synthesized  data  to  produce  noisy  data  (Figure  2(D)).  This  noisy  helicopter-based 
sensor  data  is  then  taken  to  be  the  “raw”  coarse-resolution  sensor  measurements,  analogous  to 
the  raw  ground-based  sensor  data  from  which  the  original  model  parameters  were  obtained. 
This  noisy  data  is  subsequently  used  to  obtain  coarse-resolution  features  via  the  magnetometer 
model  inversion.  It  is  important  to  reiterate  that  the  particular  features  (but  not  the  feature 
values)  extracted  from  any  image  from  a  given  sensor  will  be  identical,  regardless  of  the  image’s 
resolution. 

It  should  be  noted  that  the  amplitude  of  the  response  in  Figure  2(A)  and  2(B)  is  much  larger 
than  that  in  Figure  2(C)  and  2(D)  because  the  response  is  proportional  to  1  jr\s  where  rts  is  the 
distance  between  the  target  and  the  sensor  (see  (16)).  Also  note  that  the  (physical)  area  shown 
in  Figures  2(A)  and  2(B)  is  3m  x  3m,  while  the  area  in  Figures  2(C)  and  2(D)  is  8m  x  8m; 
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Fig.  2.  Process  of  simulating  coarse-resolution  (helicopter-based)  magnetometer  data.  (A)  Measured  fine-resolution  (ground- 
based)  magnetometer  data.  (B)  Synthesized  fine-resolution  (ground-based)  magnetometer  data  using  the  model  parameters 
obtained  from  the  model  inversion  with  the  data  in  (A).  (C)  Synthesized  noise-free  coarse-resolution  (helicopter-based) 
magnetometer  data  using  the  model  parameters  obtained  from  the  model  inversion  and  the  data  in  (A).  (D)  Same  as  (C) 
except  with  (sensor)  noise  added. 


this  larger  area  must  be  considered  in  order  to  ensure  that  the  full  response  is  captured,  for  the 
response  expands  spatially  as  the  sensor-target  distance  increases. 

V.  Experimental  Results 

To  evaluate  the  proposed  incomplete-data  classification  algorithm,  we  applied  it  to  a  UXO  data 
set  consisting  of  166  items,  41  of  which  are  UXO.  This  data  set  was  collected  by  the  Multi-sensor 
Towed  Array  Detection  System  (MTADS)  [27].  This  system  is  composed  of  arrays  of  full-field 
cesium  vapor  magnetometers  and  time-domain  electromagnetic  pulsed  induction  sensors.  The 
magnetometers  were  Geometries  Model  822ROV,  while  the  EMI  sensors  were  highly-modified 
Geonics  EM-61  sensors.  The  data  was  collected  at  a  bombing  target  on  the  Badlands  Bombing 
Range  on  the  Ogala  Sioux  Reservation  in  Pine  Ridge,  South  Dakota.  The  UXO  items  present 
at  the  site  included  M  38  (100  lb.)  sand-filled  practice  bombs,  M  57  (250  lb.)  practice  bombs, 
2.25  in.  and  2.75  in.  rocket  bodies  and  rocket  warheads,  and  ordnance  scrap  (such  as  tail  fins 
and  casing  parts). 
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For  every  item,  we  possess  both  (measured)  fine-resolution  and  (synthesized)  coarse-resolution 
features  from  each  of  the  two  sensors  (a  magnetometer  and  an  EMI  sensor).  As  mentioned  earlier, 
four  magnetometer  features  are  used  to  characterize  the  magnetometer  data  at  each  resolution 
level,  while  five  EMI  features  are  used  to  characterize  the  EMI  data  at  each  resolution  level. 

In  all  experiments,  it  is  assumed  that  coarse-resolution  (helicopter-based  sensor)  data  is  avail¬ 
able  from  both  sensors  for  all  data  points.  This  choice  is  motivated  by  the  fact  that  it  would 
be  relatively  quick,  easy,  and  inexpensive  to  acquire  such  data  vis-a-vis  ground-based  sensor 
data.  In  contrast,  it  is  assumed  that  fine-resolution  sensor  data  will  be  missing  for  some  data 
points,  with  these  amounts  made  specific  later.  It  should  be  noted,  however,  that  the  proposed 
algorithm  can  function  successfully  even  when  data  points  are  missing  data  from  a  given  sensor 
completely  {i.e.,  at  all  resolutions). 

Because  data  is  available  from  two  different  sensors,  many  different  combinations  of  missing 
data  are  possible.  To  conduct  an  extensive  investigation  of  the  proposed  algorithm,  36  different 
combinations  of  missing  data  are  considered  (explained  in  more  detail  below).  The  binary 
Cartesian  product  of  a  set  S  is  the  set  of  ordered  pairs 

SxS=  {(a,P)\ae  S  and  (3  e  S}.  (22) 

In  (22),  let  a  and  f3  be  the  fraction  of  data  points  that  are  missing  fine-resolution  magnetometer 
features  and  fine-resolution  EMI  features,  respectively.  We  conduct  experiments  using  the  ele¬ 
ments  of  the  binary  Cartesian  product  of  the  set  S  =  {0,  0.1,  0.25,  0.5,  0.75,  0.9}  as  the  pairs  of 
amounts  of  missing  primary  (fine-resolution)  data.  For  each  of  the  36  combinations  considered, 
100  independent  trials  are  run.  Each  trial  has  a  random  partition  of  the  data  set  into  training  and 
testing  data,  and  randomly  selected  data  points  that  are  assumed  to  be  missing  the  primary  data. 
Note  that  primary  data  will  be  missing  for  both  training  and  testing  data. 

This  experimental  set-up  was  employed  for  three  different  amounts  of  training  data:  when 
25%,  50%,  and  75%  of  the  data  was  labeled  training  data.  All  classification  results  shown  are 
for  the  remaining  unlabeled  testing  data.  In  all  experiments,  it  was  assumed  that  there  was  no 
labeling  error  (e  =  0).  In  each  experiment,  four  algorithms  are  applied,  each  of  which  handles 
the  multi-resolution  data  in  a  different  manner.  However,  a  logistic  regression  classifier  is  used 
in  all  four  methods,  which  are  explained  below. 

The  proposed  approach  builds  a  classifier  for  only  the  primary  data;  it  handles  missing  primary 
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data  by  integrating  out  the  missing  data,  using  the  estimated  density  function  relating  both  the 
primary  and  auxiliary  data.  This  density  function  —  a  GMM  —  is  accurately  estimated  using 
all  available  data,  via  the  Variational  Bayesian  Expectation-Maximization  algorithm  presented 
in  [17].  Because  class  labels  are  not  used  in  the  estimation,  both  labeled  and  unlabeled  data  can 
be  utilized.  This  fact  ensures  that  the  density  function  can  be  accurately  estimated  even  when 
limited  (labeled)  training  data  is  available. 

The  second  method  builds  a  separate  classifier  for  data  from  each  resolution.  Building  separate 
classifiers  for  data  from  each  resolution  in  the  case  of  a  single  sensor  with  two  resolutions  would 
entail  that  one  classifier  be  built  for  features  extracted  from  fine-resolution  imagery,  and  a  second 
classifier  be  built  for  features  extracted  from  coarse-resolution  imagery.  The  generalization  of 
this  case  to  multiple  sensors  with  multiple  resolutions  is  employed  here  as  the  second  method. 
Specifically,  four  separate  classifiers  are  constructed,  one  to  handle  each  sensor-resolution  pair 
combination.  A  more  detailed  explanation  of  this  method  is  provided  in  the  Appendix. 

The  third  method  builds  a  classifier  for  the  concatenated  primary  and  auxiliary  data;  it  handles 
missing  primary  data  by  integrating  out  the  missing  data,  via  the  approach  used  in  [17].  The 
difference  between  this  method  and  the  proposed  method  is  that  this  method  builds  a  classifier 
for  both  auxiliary  and  primary  data,  whereas  the  proposed  method  does  so  only  on  the  latter.  The 
fourth  method  also  builds  a  classifier  for  the  concatenated  primary  and  auxiliary  data;  however, 
this  method  imputes  (i.e.,  “fills  in”)  missing  primary  data  with  the  unconditional  mean  of  the 
observed  data. 

The  area  under  a  receiver  operating  characteristic  (ROC)  curve  (AUC)  is  given  by  the  Wilcoxon 
statistic  [28] 

1  M  N 

AUC  = - VVl,  (23) 

MN  j  <£—1  m>2M  v  ' 

m= 1  n=  1 

where  Xi,...,xM  are  the  classifier  outputs  of  data  belonging  to  class  1,  yi,...,yN  are  the 
classifier  outputs  of  data  belonging  to  class  -1,  and  1  is  an  indicator  function.  As  a  measure  of 
classification  performance,  the  AUC  is  a  more  useful  quantity  than  accuracy  (i.e.,  the  fraction 
of  classifications  that  are  correct)  when  significant  class  imbalance  exists,  as  it  does  in  this  data 
set.  Moreover,  the  AUC  can  summarize  performance  more  compactly  than  an  ROC  curve.  For 
these  reasons,  we  present  the  results  of  the  classification  experiments  in  terms  of  the  AUC. 

The  results  of  all  of  the  experiments  are  compactly  summarized  in  Figures  3,  4,  and  5.  The 
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Fraction  of  Data  Points  Missing  High  Resolution  Magnetometer  Data 


Fig.  3.  Experimental  results  in  terms  of  AUC  when  25%  of  the  data  set  is  (labeled)  training  data.  (A)  The  proposed  method; 
(B)  four  separate  classifiers  are  built,  one  for  each  possible  combination  of  missing  data;  (C)  one  classifier  is  built  on  all 
features,  with  missing  data  integrated  out  analytically;  and  (D)  one  classifier  is  built  on  all  features,  with  missing  data  handled 
via  unconditional  mean  imputation. 


results  are  displayed  in  these  figures  as  images,  interpolated  from  the  results  of  the  finite  set 
of  36  pairs  of  missing-data  conditions  explained  previously.  Specifically,  the  images  display  the 
AUC  values  as  a  function  of  the  amounts  of  missing  fine-resolution  magnetometer  and  missing 
fine-resolution  EMI  sensor  data.  The  results  from  which  the  resulting  images  were  interpolated 
were  the  mean  AUC  values  over  the  100  independent  trials  of  the  36  pairs  of  conditions.  The 
color  scales  are  identical  in  the  four  panels  within  each  figure,  so  visual  comparisons  among  the 
methods’  results  can  be  made  easily. 

As  can  be  seen  from  the  three  figures,  the  proposed  method  consistently  performs  better  than 
the  other  three  competing  methods,  regardless  of  the  amounts  of  missing  high-resolution  data. 

VI.  Discussion 

It  should  be  emphasized  that  in  the  proposed  method,  the  classifier  weights  are  on  the  pri¬ 
mary  features,  which  are  extracted  from  fine-resolution  imagery.  However,  the  auxiliary  features 
extracted  from  coarse-resolution  imagery  are  still  utilized  in  the  algorithm  when  primary  data  is 
missing.  Specifically,  missing  primary  data  is  analytically  integrated  out  via  the  estimated  density 
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Fig.  4.  Experimental  results 
Figure  3  for  additional  details. 


Fig.  5.  Experimental  results 
Figure  3  for  additional  details. 
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in  terms  of  AUC  when  50%  of  the  data  set  is  (labeled)  training  data.  Refer  to  the  caption  of 
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function,  which  models  the  relationship  between  the  features  from  coarse-resolution  imagery  and 
those  same  features  from  fine-resolution  imagery.  The  experimental  results  consistently  show 
that  the  proposed  method  outperforms  the  alternative  methods.  Here  we  explain  the  reasons 
underlying  this  result. 

Fine-resolution  imagery  contains  salient  aspects  that  are  absent  in  coarse-resolution  imagery. 
Therefore,  features  extracted  from  fine-resolution  imagery  should  be  preferred  to  features  ex¬ 
tracted  from  coarse-resolution  imagery.  Our  proposed  approach  emphasizes  the  importance  of 
the  finer-resolution  data  by  building  a  classifier  with  only  that  data.  The  coarser-resolution  data 
is  still  exploited  (via  the  estimated  density  function),  albeit  in  an  auxiliary  role. 

If  a  classifier  is  instead  built  on  the  conglomerated  features  extracted  from  different  resolutions 
—  as  it  was  in  two  of  the  alternative  methods  —  the  information  about  the  relative  “quality” 
of  the  features  is  ignored.  Concatenating  features  extracted  from  different  resolution  imagery 
also  causes  the  feature-dimension  to  grow  quickly,  which  can  in  turn  lead  to  overfitting  of 
the  training  data.  In  theory,  a  prior  could  be  incorporated  to  combat  overfitting,  but  additional 
complications  arise  as  a  result  of  having  incomplete  data.  In  contrast,  the  proposed  method  has 
no  such  overfitting  issues. 

The  proposed  method  also  consistently  outperforms  the  method  that  builds  a  separate  classifier 
for  data  from  each  sensor-resolution  combination.  This  result  is  possible  because  the  proposed 
method  utilizes  side  information  in  the  form  of  the  estimated  density  function.  By  exploiting 
the  statistical  relationship  that  exists  among  features  at  different  resolutions  (as  well  as  among 
features  from  different  sensors),  better  performance  can  be  achieved.  This  result  can  perhaps  best 
be  understood  from  the  viewpoint  of  super-resolution  techniques.  Knowledge  about  a  problem 
{e.g.,  that  noise  in  an  image  is  Gaussian)  can  be  exploited  to  resolve  a  super-resolution  image  from 
several  blurry  images  [29].  Similarly,  in  this  problem,  knowledge  of  the  statistical  relationship 
between  features  at  different  resolutions  can  be  exploited.  Importantly,  the  proposed  approach 
avoids  the  unnecessary  intermediate  step  of  forming  an  entire  super-resolution  image;  instead, 
the  ultimate  goal  is  addressed  directly:  obtaining  the  equivalent  of  “super-resolution  features.” 

VII.  Conclusion 

Acquiring  fine-resolution  imagery  for  all  data  points  may  be  prohibitively  expensive.  For 
example,  in  the  UXO  detection  problem,  deploying  ground-based  sensors  is  dangerous  and 
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time-consuming.  This  work  presents  a  principled  algorithm  to  classify  imagery  that  is  available 
at  multiple  resolutions.  Because  some  data  points  may  possess  imagery  at  only  a  subset  of 
resolutions,  the  problem  can  be  viewed  as  one  of  incomplete-data  classification.  The  proposed 
algorithm  also  naturally  handles  the  case  in  which  multiple  sensor  modalities  —  each  of  which 
may  operate  at  multiple  resolutions  —  are  used  to  acquire  data.  In  summary,  the  novel  problem 
we  addressed  was  of  multi-sensor,  multi-resolution,  incomplete-data  classification.  Experimental 
results  on  a  challenging  UXO  classification  task  employing  magnetometer  and  EMI  sensors 
demonstrated  the  advantage  of  the  proposed  algorithm  over  common  alternatives. 

Future  work  will  focus  on  the  development  of  an  active  data  acquisition  algorithm  that 
determines  which  data  points  should  receive  finer-resolution  imagery  —  and  at  which  particular 
resolution  level  —  in  order  to  most  improve  performance.  This  active  sensing  concept  is  relevant 
for  many  applications,  including  medical  imaging,  remote  sensing,  and  video  tracking. 
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Appendix 


A.  Modified  Gradient  Ascent 

The  classifier  w  from  Section  III  is  learned  via  a  modified  form  of  gradient  ascent, 
method  uses  the  gradient  and  Hessian  of  the  log-likelihood,  which  we  provide  explicitly 
For  convenience,  we  first  rewrite  the  log-likelihood  function  (15)  as 
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5.  Classification  Method  2 

Here  we  explain  in  greater  detail  the  second  classification  method  used  in  the  experiments. 
Define  sensor  1  to  be  the  magnetometer,  and  define  sensor  2  to  be  the  EMI  sensor.  Recall  that 
data  from  two  image  resolutions  are  available  for  each  of  the  two  sensors.  Let  7J  be  the  set  of 
data  points  for  which  primary  data  from  the  s-th  sensor  is  possessed;  let  7 f  be  the  set  of  data 
points  for  which  auxiliary  data  from  the  .s-th  sensor  is  possessed.  Context  will  elucidate  whether 
the  sets  contain  training  or  testing  data  points.  Note  that  7^  C  7*  in  all  experiments  in  this  paper 
because  it  is  assumed  that  auxiliary  data  is  available  for  all  data  points.  Let  7 SZ\YX  denote  the 
set  of  data  points  in  7 1  but  not  in  yfi  Table  I  compactly  summarizes  the  manner  in  which  the 
various  classifiers  of  this  method  are  constructed  and  utilized. 


table  1 

Explanation  of  classification  method  2  of  the  experiments 


Classifier 

Features  on  Which 

The  Classifier 

Is  Built 

Training  Data 

Points  Used  to 

Train  Classifier 

Testing  Data 

Points  Evaluated 

By  Classifier 

1 

a32)] 

7 

Tiny2 

2 

[*(!),  z(2)] 

7 1  n  (7i\7x) 

3 

[zW,xM] 

(7z\7x)  n  7x 

4 

zW] 

Tin  fit 

(7i\7i)n(7z2\7x2) 

Lor  example,  a  training  data  point  that  has  both  fine-resolution  and  coarse-resolution  magne¬ 
tometer  data  and  both  fine-resolution  and  coarse-resolution  EMI  sensor  data  would  be  used  in  the 
construction  of  all  four  classifiers.  A  testing  data  point  that  has  both  fine-resolution  and  coarse- 
resolution  magnetometer  data,  but  only  coarse-resolution  EMI  sensor  data  would  be  evaluated 
( i.e classified)  using  classifier  2. 


23 


To  summarize,  in  the  training  stage,  all  data  points  that  possess  the  requisite  features  are  used 
to  train  the  classifier.  This  arrangement  allows  more  data  to  be  used  in  building  the  classifiers, 
and  hence  allows  more  accurate  classifiers  to  be  obtained.  In  the  testing  stage,  a  given  testing 
data  point  is  submitted  to  that  classifier  that  fully  exploits  the  fine -resolution  features  that  the 
data  point  possesses. 
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